-
Notifications
You must be signed in to change notification settings - Fork 74
GroupedBlockQuantizeOp PR1: Adding codegen support #5776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: jj/grouped_block_quantize_op_0
Are you sure you want to change the base?
GroupedBlockQuantizeOp PR1: Adding codegen support #5776
Conversation
|
Review updated until commit fc79a9c Description
|
| Relevant files | |||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Enhancement | 25 files
| ||||||||||||||||||||||||||||||||||||||||||||||||||
| Tests | 1 files
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Runtime Function Interface
|
Greptile Summary
Important Files Changed
Confidence score: 4/5
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
26 files reviewed, 6 comments
| GroupedBlockQuantizationOp::GroupedBlockQuantizationOp( | ||
| IrBuilderPasskey passkey, | ||
| Val* output_scales, | ||
| Val* output, | ||
| Val* input, | ||
| Val* input_offsets, | ||
| Val* output_offsets, | ||
| BlockScalingFactorLayout layout, | ||
| Val* k, | ||
| Val* g, | ||
| Val* global_scale, | ||
| int64_t block_size, | ||
| Val* row_idx, | ||
| Val* col_idx) | ||
| : Expr(passkey) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider adding parameter validation similar to the parent operations (BlockQuantizationOp and PreprocessGroupedMatmulInputSf) to ensure inputs meet expected constraints
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| NVF_ERROR( | ||
| output_dtype != DataType::Float8_e4m3fn, | ||
| "output of Float8_e4m3fn is not yet implemented"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Different error handling compared to BlockQuantizationOp - this throws an error for Float8_e4m3fn output while BlockQuantizationOp only restricts global scale usage for this type. Is this difference in Float8_e4m3fn handling intentional, or should GroupedBlockQuantizationOp support this data type with the same restrictions as BlockQuantizationOp?
| } else if ( | ||
| tv->definition() && tv->definition()->isA<GroupedBlockQuantizationOp>()) { | ||
| auto bqop = tv->definition()->as<GroupedBlockQuantizationOp>(); | ||
| if (tv == bqop->blockScales()) { | ||
| skip_validation = true; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Missing swizzle condition check for consistency with BlockQuantizationOp. The existing logic checks bqop->isSwizzledScales() but this version only checks if the tensor is blockScales. Should this also check for a swizzled condition like the BlockQuantizationOp case, or are all GroupedBlockQuantizationOp blockScales always meant to skip validation?
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| auto has_block_quantization_ops = | ||
| HeuristicDataCacheEntry<HeuristicCompileTime::HasBlockQuantizationOps>( | ||
| data_cache, | ||
| [fusion]() { | ||
| return std::make_unique<bool>( | ||
| !ir_utils::getOpsOfType<BlockQuantizationOp>(fusion).empty() || | ||
| !ir_utils::getOpsOfType<GroupedBlockQuantizationOp>(fusion) | ||
| .empty()); | ||
| }) | ||
| .get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Duplicate logic - the same check for block quantization ops is performed twice in this function. Consider extracting this into a helper function or reusing the cache entry from canScheduleRunTime.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| bool hasGlobalScale() const { | ||
| if (inputs().size() > 5) { | ||
| return true; | ||
| } | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Inconsistent global scale detection logic compared to BlockQuantizationOp which checks inputs().size() > 1. Is the different threshold (5 vs 1) intentional based on the different input structure?
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| auto block_scales_dtype = (out_dtype == DataType::Float4_e2m1fn) | ||
| ? DataType::Float8_e4m3fn | ||
| : DataType::Float8_e8m0fnu; | ||
| NVF_ERROR_EQ(inp_domain.size(), 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: This constraint requires exactly 2D input tensors - should validate this assumption earlier in the function. Should this validation be moved earlier in the function to fail fast?
Context
The series of PRs is trying to enable a single kernel for quantization and layout handling of block scaling factor on grouped tensors.
Existing solution for nvfp4 quantization of activation Tensor for grouped_mm relies on two operation:
i. BlockQuantizationOp produces scaled_tv and block_scaling_factor.
ii. block_scaling_factor needs to be processed by PreprocessGroupedMatmulInputSf in order to satisfy the swizzle layout required by grouped_mm kernels
The series of PRs tries to merge the two operation into a single one.
Stacked PRs
#5775 GroupedBlockQuantizationOp PR0: Adding runtime function
#5776 GroupedBlockQuantizationOp PR1: Adding codegen support
#5777 GroupedBlockQuantizationOp PR2: Adding python API and updating llama4 benchmark
What's in this PR
Adding Fusion IR node GroupedBlockQuantizationOp. The operation is a combination of BlockQuantizationOp and PreprocessGroupedMatmulInputSf, where it inherits all the validation / checks from the two operations.
The operation is similar to BlockQuantizationOp, with the exception that:
i. The block scaling factor output doesn't have the swizzle logic represented as allocation domain transformations;
ii. It takes an additional inputs (input_offsets and output_offsets) to facilitate group indexing, similar to PreprocessGroupedMatmulInputSf.
Adding cpp test case for GroupedBlockQuantizationOp.