add auto tma transpose scheduler by liqiangxl · Pull Request #6018 · NVIDIA/Fuser

liqiangxl · 2026-02-27T15:33:10Z

To reduce number of tranpose ops, is_output_smem_transpose is added to control input/output transpose:

1. When there are more inputs than outputs, is_output_smem_transpose = True
TMA load without swizzle, TMA store with swizzle, transpose at regs --> output cached smem

2. When there are less inputs than outputs, is_output_smem_transpose = False
TMA load with swizzle, register store, transpose at input cached smem -> regs

Current performance is in this doc.

liqiangxl · 2026-02-27T15:33:22Z

!test

liqiangxl · 2026-02-27T15:40:55Z

!test

greptile-apps · 2026-02-27T15:45:04Z

Greptile Summary

This PR implements an automatic TMA (Tensor Memory Accelerator) transpose scheduler that optimizes transpose operations by choosing between input and output shared memory transpose paths based on the number of inputs vs outputs.

Key changes:

Added auto-selection logic: when n_input > n_output, use output smem transpose (fewer outputs to swizzle); otherwise use input smem transpose (fewer inputs to swizzle)
Implemented complete TMA load/store scheduling with proper swizzling, tiling, and parallelization
Added new heuristic parameters: use_tma_store, is_output_smem_transpose, chunks_per_thread, elements_per_chunk
Gated feature behind EnableOption::TmaTranspose for controlled rollout
Comprehensive test coverage for both transpose paths with multiple data types and dimensions

Minor issues found:

Typo in comment: "tranpose" → "transpose" at csrc/scheduler/transpose_tma.cpp:185
Consider adding defensive check for division by zero at line 94 (though scheduler validation should prevent this)

Confidence Score: 4/5

This PR is safe to merge with only minor issues found
The implementation is well-structured with comprehensive test coverage, proper error handling, and sound transpose optimization logic. Only found one typo and one defensive programming suggestion. The feature is properly gated behind an enable option for controlled rollout.
Pay attention to csrc/scheduler/transpose_tma.cpp for the typo fix

Important Files Changed

Filename	Overview
csrc/scheduler/transpose_tma.cpp	Implements auto TMA transpose scheduler with heuristics and scheduling logic for both input and output shared memory transpose paths
csrc/scheduler/transpose_heuristic.h	Added new parameters for TMA transpose: use_tma_store, is_output_smem_transpose, chunks_per_thread, elements_per_chunk with proper equality checks and hashing
csrc/scheduler/transpose.cpp	Added TmaTranspose option gate and updated scheduling condition to check both use_tma_load and use_tma_store

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    Start([TMA Transpose Scheduler Entry]) --> CheckEnabled{TmaTranspose<br/>enabled?}
    CheckEnabled -->|No| Fallback[Use non-TMA<br/>transpose scheduler]
    CheckEnabled -->|Yes| CountIO[Count inputs<br/>and outputs]
    CountIO --> CompareIO{n_input > n_output?}
    CompareIO -->|Yes| OutputPath[Output Smem Transpose Path]
    CompareIO -->|No| InputPath[Input Smem Transpose Path]
    
    OutputPath --> OutputConfig[use_tma_load = true<br/>use_tma_store = true<br/>is_output_smem_transpose = true]
    OutputConfig --> OutputFlow[TMA load inputs<br/>without swizzle<br/>→ Register ops<br/>→ TMA store to<br/>swizzled output smem]
    
    InputPath --> InputConfig[use_tma_load = true<br/>use_tma_store = false<br/>is_output_smem_transpose = false]
    InputConfig --> InputFlow[TMA load to<br/>swizzled input smem<br/>→ Register ops<br/>→ Regular store to output]
    
    OutputFlow --> Schedule[Schedule: tile dimensions,<br/>parallelize, vectorize]
    InputFlow --> Schedule
    Schedule --> End([Execute Kernel])
    Fallback --> End

_{Last reviewed commit: 6963ca5}

greptile-apps

_{7 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-27T15:45:08Z

csrc/scheduler/transpose_tma.cpp

+  NVF_ERROR(grouped_inputs_outputs.size() >= 2);
+
+  // When there are more inputs than outputs, output smem transpose should be
+  // used, however, if it is not, then input smem tranpose will be used, to


tranpose should be transpose

greptile-apps · 2026-02-27T15:45:09Z

csrc/scheduler/transpose_tma.cpp

+  const int64_t cta_per_sm =
+      dev_props->maxThreadsPerMultiProcessor / threads_per_cta;
+  const int64_t bytes_per_cta = bytes_per_sm / cta_per_sm;
+  const int64_t bytes_per_tile = bytes_per_cta / n_input;


Add check that n_input > 0 before this division. While the scheduler validation should prevent this, defensive programming would make the code more robust.

Suggested change

const int64_t bytes_per_tile = bytes_per_cta / n_input;

NVF_ERROR(n_input > 0, "Expected at least one TensorView input for transpose");

const int64_t bytes_per_tile = bytes_per_cta / n_input;

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

add auto tma transpose scheduler

cd77679

minor opt code

6963ca5

liqiangxl marked this pull request as ready for review February 27, 2026 15:40

greptile-apps bot reviewed Feb 27, 2026

View reviewed changes

liqiangxl requested a review from rdspring1 February 27, 2026 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add auto tma transpose scheduler#6018

add auto tma transpose scheduler#6018
liqiangxl wants to merge 2 commits intomainfrom
llu/transpose_output_smem_auto

liqiangxl commented Feb 27, 2026

Uh oh!

liqiangxl commented Feb 27, 2026

Uh oh!

liqiangxl commented Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 27, 2026

Uh oh!

greptile-apps bot Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	const int64_t bytes_per_tile = bytes_per_cta / n_input;
	NVF_ERROR(n_input > 0, "Expected at least one TensorView input for transpose");
	const int64_t bytes_per_tile = bytes_per_cta / n_input;

Conversation

liqiangxl commented Feb 27, 2026

Uh oh!

liqiangxl commented Feb 27, 2026

Uh oh!

liqiangxl commented Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant