-
Notifications
You must be signed in to change notification settings - Fork 694
Try some QD8-BF16 Experiments #11466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11466
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit d4845a0 with merge base fff7b3c ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
can we check against bf16 llm w/ and w/o xnnpack qd8-bf16-qb4w (running q/dq if possible or even non quantized)? |
Hi @mcr229! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
We've prototyped some new QD8-BF16-QB4W kernels in XNNPACK. Let's try leveraging them in ExecuTorch and see what our performance looks like:
Exports:
There is no change in size since this only affects activations. To make the comparisons more fair, we removed delegation of all other operators aside from qd8-bf16-qb4w. This is because in XNNPACK we are still lacking those bf16 operators.
Some things to notice here is that the BF16 model uses 1/3 of the memory as the fp32 model. Additionally, we see some performance drops in BF16. This is likely because the GEMM kernel contains an extra shift to perform bf16 (things are still calculated in f32, just right shifted before storing). Additionaly the Quantize kernel for bf16 --> qd8 is still a naive implementation so it is a bit slower. Another thing to notice is that the results seem to be nonsensical.