-
Notifications
You must be signed in to change notification settings - Fork 74
support dynamic shapes in warp specialized inner outer persistent scheduler #5765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Review updated until commit 7397013 Description
|
| Relevant files | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Enhancement | 8 files
| ||||||||||||||||
| Tests |
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Warp reduction logic complexity
|
Test failures
-
(Medium, 26)
nvFuser validation mismatches in CombinedSchedulerTest and Gpu1Test suitesTest Name A100 GB200 H100 Source CombinedSchedulerTest.IllegalSizeToUseTMA ❌ ❌ Link CombinedSchedulerTest.LayerNormBackward/dtype___half_batch_216_hidden_1024 ❌ ❌ ❌ Link CombinedSchedulerTest.LayerNormBackward/dtype___half_batch_216_hidden_768 ❌ ❌ ❌ Link CombinedSchedulerTest.LayerNormBackward/dtype_double_batch_216_hidden_1024 ❌ ❌ ❌ Link CombinedSchedulerTest.LayerNormBackward/dtype_double_batch_216_hidden_576 ❌ ❌ ❌ Link CombinedSchedulerTest.LayerNormBackward/dtype_double_batch_216_hidden_768 ❌ ❌ ❌ Link CombinedSchedulerTest.LayerNormBackward/dtype_float_batch_216_hidden_1024 ❌ ❌ ❌ Link CombinedSchedulerTest.LayerNormBackward/dtype_float_batch_216_hidden_768 ❌ ❌ ❌ Link Gpu1Test.FusionMagicSchedulerRMSNormBackward_CUDA ❌ ❌ ❌ Link -
(Medium, 10)
DistributedTransformer backward fp16/bf16 numerical mismatches across runnersTest Name A100 A100 (dist.) GB200 GB200 (dist.) H100 Source DistributedTransformerTest.Backward/__bfloat ❌ ❌ ❌ ❌ ❌ Link DistributedTransformerTest.Backward/__half ❌ ❌ ❌ ❌ ❌ Link -
(Medium, 3)
Gradient mismatch in nanoGPT CUDAGraphs nvFuser test (thunder/tests/test_networks)Test Name A100 GB200 H100 Source thunder.tests.test_networks.test_nanogpt_complete_cudagraphs_autograd_nvfuser_cuda_thunder.dtypes.float32 ❌ ❌ ❌ -
(Medium, 3)
Thunder NVFuser gradient mismatches in test_grad::test_populate_grads_blockTest Name A100 GB200 H100 Source thunder.tests.test_grad.test_populate_grads_block_nvfuser_cuda_thunder.dtypes.float32 ❌ ❌ ❌
Greptile SummaryThis PR enables dynamic shape support in the warp specialized inner-outer persistent scheduler by threading
The approach maintains correctness by computing launch params early during compilation (before lowering) and making them available throughout the lowering pipeline. Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant KernelExecutor
participant CompiledKernel
participant GpuLower
participant Scheduler
participant Lowering
User->>KernelExecutor: compile(launch_constraints)
KernelExecutor->>CompiledKernel: new CompiledKernel(launch_params)
CompiledKernel->>GpuLower: new GpuLower(cparams, lparams)
Note over GpuLower: Stores lparams_ as member
GpuLower->>Scheduler: Schedule tensors
Note over Scheduler: Uses batches_per_block_inner_reduction<br/>for dynamic split instead of static bdimx
GpuLower->>Lowering: Lower to device code
Lowering->>GpuLower: launchParams().getDim(TIDx)
Note over Lowering: parallel_dimension_map and<br/>fused_reduction check GpuLower::current()<br/>->launchParams() for dynamic dimensions
Lowering-->>GpuLower: Use launch param value
GpuLower-->>CompiledKernel: Lowered kernel
CompiledKernel-->>KernelExecutor: Compiled kernel
Note over KernelExecutor: Runtime execution uses<br/>actual launch params for<br/>warp reduction optimization
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
12 files reviewed, 1 comment
| EXPECT_EQ(numRuntimes(), 1) | ||
| << "Same dimensions should reuse the existing kernel"; | ||
|
|
||
| FusionKernelRuntime* second_runtime = | ||
| executor_cache.getMostRecentKernelRuntime(); | ||
| EXPECT_EQ(first_runtime, second_runtime) | ||
| << "Should reuse the same runtime for identical shapes"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: the comments are misleading here. Line 2370 says "Same dimensions should reuse the existing kernel" but the shapes are different ({2048, 4096} vs {2048 + 8, 4096}). Line 2376 says "Should reuse the same runtime for identical shapes" but the shapes are not identical. Consider clarifying that the outer dimension changes but should still reuse the kernel due to dynamic shape support.
| EXPECT_EQ(numRuntimes(), 1) | |
| << "Same dimensions should reuse the existing kernel"; | |
| FusionKernelRuntime* second_runtime = | |
| executor_cache.getMostRecentKernelRuntime(); | |
| EXPECT_EQ(first_runtime, second_runtime) | |
| << "Should reuse the same runtime for identical shapes"; | |
| EXPECT_EQ(numRuntimes(), 1) | |
| << "Different outer dimension should reuse the existing kernel due to dynamic shape support"; | |
| FusionKernelRuntime* second_runtime = | |
| executor_cache.getMostRecentKernelRuntime(); | |
| EXPECT_EQ(first_runtime, second_runtime) | |
| << "Should reuse the same runtime despite different outer dimension"; |
…alized persistent kernel
3993211 to
3b68cf3
Compare
|
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR enables dynamic shape support in the warp specialized inner-outer persistent scheduler by threading LaunchParams through the compilation pipeline to GpuLower.
- Core mechanism: The PR passes launch parameters from
KernelExecutor::compile()throughCompiledKerneltoGpuLower, making them accessible during lowering analysis - Dynamic shape detection:
FusedReductionanalysis now checksGpuLower::launchParams()to determine ifbdimxis statically known, enabling warp reduction optimizations even when tensor dimensions are symbolic - Register sharing:
ParallelDimensionMap::getThreadCountInDim()now consults launch parameters for dynamic dimensions, allowing register sharing when thread counts are known at compile time - Scheduler change: Replaced static
inner_parallel_static()split with dynamic split +padToMultipleOfWarp()for TIDx in TMA warp specialized path - Test updates: Converted tests from concrete to symbolic tensors to validate dynamic shape support, added kernel reuse test
Issues found:
- Comment formatting broken in
csrc/scheduler/reduction_utils.cpplines 137-142 - Typo "uisng" → "using" in test comment
- Misleading test comment claiming "same dimensions" when dimensions actually differ
Confidence Score: 4/5
- This PR is safe to merge with minor formatting fixes needed
- The implementation follows a clean architectural pattern by threading launch params through the compilation stack. The logic changes are sound and enable an important optimization. However, formatting issues in comments (syntax errors) require fixing before merge, preventing a score of 5.
- Fix syntax errors in
csrc/scheduler/reduction_utils.cpplines 137-142 before merging
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| csrc/device_lower/analysis/fused_reduction.cpp | 4/5 | Enhanced warp reduction detection to support dynamic shapes by checking launch params when bdimx is not statically known |
| csrc/parallel_dimension_map.cpp | 4/5 | Updated getThreadCountInDim to use launch params for dynamic dimensions when available, enabling register sharing |
| csrc/runtime/compiled_kernel.cpp | 5/5 | Added LaunchParams parameter to constructors and passed it to GpuLower for dynamic shape support |
| csrc/scheduler/reduction_utils.cpp | 3/5 | Replaced static split with dynamic split and padToMultipleOfWarp() for TIDx parallelization, has formatting issues in comments |
| tests/cpp/test_persistent_buffer.cpp | 3/5 | Mixed changes: switched tensors between concrete/symbolic, added kernel reuse test with misleading comment on line 2371 |
Sequence Diagram
sequenceDiagram
participant Executor as KernelExecutor
participant CK as CompiledKernel
participant GL as GpuLower
participant FR as FusedReduction
participant PDM as ParallelDimensionMap
participant Sched as ScheduleReductionTV
Note over Executor: compile() called with launch_constraints
Executor->>CK: new CompiledKernel(fusion, compile_params, launch_constraints)
CK->>GL: new GpuLower(fusion, compile_params, launch_params)
Note over GL: Store launch_params in lparams_ member
GL->>FR: Analyze fused reductions
FR->>GL: Check launchParams().hasDim(TIDx)
Note over FR: If bdimx available in launch params,<br/>use it to validate warp reduction
FR->>FR: Determine if static warp reduction is valid
GL->>Sched: scheduleReductionTV() for TMA warp specialized
Note over Sched: Use dynamic split + padToMultipleOfWarp()<br/>instead of static split
GL->>PDM: getThreadCountInDim(ParallelType)
PDM->>GL: Check launchParams().hasDim(pt)
Note over PDM: Return actual launch param value<br/>for dynamic dimensions when available
PDM-->>GL: Return thread count or -1
Note over GL: Lowering complete with dynamic shape support
| // // static bdimx is required for TMA warp specialization | ||
| // int64_t compute_bdimx = getComputeBdimx(option, | ||
| // rparams->lparams.bdimx()); inner_parallel_static(inner_reduce_axis, | ||
| // ParallelType::TIDx, compute_bdimx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment formatting is broken - line breaks should be after the comment prefix.
| // // static bdimx is required for TMA warp specialization | |
| // int64_t compute_bdimx = getComputeBdimx(option, | |
| // rparams->lparams.bdimx()); inner_parallel_static(inner_reduce_axis, | |
| // ParallelType::TIDx, compute_bdimx); | |
| // static bdimx is required for TMA warp specialization | |
| // int64_t compute_bdimx = getComputeBdimx(option, | |
| // rparams->lparams.bdimx()); | |
| // inner_parallel_static(inner_reduce_axis, ParallelType::TIDx, compute_bdimx); |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| // rparams->lparams.bdimx()); inner_parallel_static(inner_reduce_axis, | ||
| // ParallelType::TIDx, compute_bdimx); | ||
|
|
||
| // Iteration: [I/Unroll/BIDy, BIDy, Unroll] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong comment prefix - should be // not /.
| // Iteration: [I/Unroll/BIDy, BIDy, Unroll] | |
| // Iteration: [I/Unroll/BIDy, BIDy, Unroll] |
| auto tv0 = makeContigConcreteTensor({dim0, dim1}, dtype); | ||
| auto tv1 = makeContigConcreteTensor({dim0, dim1}, dtype); | ||
| // For case contig_1_dtype_float_batch_2048_hidden_8192 | ||
| // the performance is 59.7% SOL uisng concrete inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: 'uisng' should be 'using'.
| // the performance is 59.7% SOL uisng concrete inputs | |
| // the performance is 59.7% SOL using concrete inputs |
| EXPECT_EQ(numRuntimes(), 1) | ||
| << "Same dimensions should reuse the existing kernel"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment says "Same dimensions should reuse" but the test uses different outer dimension (2048 + 8 vs 2048) - comment is misleading.
|
!test |
No description provided.