[not for land] float8 blockwise scaling training prototype using deep_gemm #2386
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Since this is a common community request, I did a test drive of how we could integrate
deep_gemm
into an e2e training workflow.deep_gemm
(https://github.com/deepseek-ai/DeepGEMM) provides the following things:What I saw:
grad_weight
: https://gist.github.com/vkuzo/6e9cacb226593f7e5f27ac5cd5e79fb1. For now, work around this by leaving the gemm to calculate grad_weight in bf16. Something is funky with how we are wrapping the 128_1_128_1 gemm.If we were to integrate this, here is the path forward:
deep_gemm
s 128_1_128_1 gemm work properly, or write our own, or just leave this matmul in bf16