Skip to content

add auto tma transpose scheduler#6018

Open
liqiangxl wants to merge 2 commits intomainfrom
llu/transpose_output_smem_auto
Open

add auto tma transpose scheduler#6018
liqiangxl wants to merge 2 commits intomainfrom
llu/transpose_output_smem_auto

Conversation

@liqiangxl
Copy link
Collaborator

To reduce number of tranpose ops, is_output_smem_transpose is added to control input/output transpose:

1. When there are more inputs than outputs, is_output_smem_transpose = True
TMA load without swizzle, TMA store with swizzle, transpose at regs --> output cached smem

2. When there are less inputs than outputs, is_output_smem_transpose = False
TMA load with swizzle, register store, transpose at input cached smem -> regs

Current performance is in this doc.

@liqiangxl
Copy link
Collaborator Author

!test

@liqiangxl
Copy link
Collaborator Author

!test

@liqiangxl liqiangxl marked this pull request as ready for review February 27, 2026 15:40
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 27, 2026

Greptile Summary

This PR implements an automatic TMA (Tensor Memory Accelerator) transpose scheduler that optimizes transpose operations by choosing between input and output shared memory transpose paths based on the number of inputs vs outputs.

Key changes:

  • Added auto-selection logic: when n_input > n_output, use output smem transpose (fewer outputs to swizzle); otherwise use input smem transpose (fewer inputs to swizzle)
  • Implemented complete TMA load/store scheduling with proper swizzling, tiling, and parallelization
  • Added new heuristic parameters: use_tma_store, is_output_smem_transpose, chunks_per_thread, elements_per_chunk
  • Gated feature behind EnableOption::TmaTranspose for controlled rollout
  • Comprehensive test coverage for both transpose paths with multiple data types and dimensions

Minor issues found:

  • Typo in comment: "tranpose" → "transpose" at csrc/scheduler/transpose_tma.cpp:185
  • Consider adding defensive check for division by zero at line 94 (though scheduler validation should prevent this)

Confidence Score: 4/5

  • This PR is safe to merge with only minor issues found
  • The implementation is well-structured with comprehensive test coverage, proper error handling, and sound transpose optimization logic. Only found one typo and one defensive programming suggestion. The feature is properly gated behind an enable option for controlled rollout.
  • Pay attention to csrc/scheduler/transpose_tma.cpp for the typo fix

Important Files Changed

Filename Overview
csrc/scheduler/transpose_tma.cpp Implements auto TMA transpose scheduler with heuristics and scheduling logic for both input and output shared memory transpose paths
csrc/scheduler/transpose_heuristic.h Added new parameters for TMA transpose: use_tma_store, is_output_smem_transpose, chunks_per_thread, elements_per_chunk with proper equality checks and hashing
csrc/scheduler/transpose.cpp Added TmaTranspose option gate and updated scheduling condition to check both use_tma_load and use_tma_store

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    Start([TMA Transpose Scheduler Entry]) --> CheckEnabled{TmaTranspose<br/>enabled?}
    CheckEnabled -->|No| Fallback[Use non-TMA<br/>transpose scheduler]
    CheckEnabled -->|Yes| CountIO[Count inputs<br/>and outputs]
    CountIO --> CompareIO{n_input > n_output?}
    CompareIO -->|Yes| OutputPath[Output Smem Transpose Path]
    CompareIO -->|No| InputPath[Input Smem Transpose Path]
    
    OutputPath --> OutputConfig[use_tma_load = true<br/>use_tma_store = true<br/>is_output_smem_transpose = true]
    OutputConfig --> OutputFlow[TMA load inputs<br/>without swizzle<br/>→ Register ops<br/>→ TMA store to<br/>swizzled output smem]
    
    InputPath --> InputConfig[use_tma_load = true<br/>use_tma_store = false<br/>is_output_smem_transpose = false]
    InputConfig --> InputFlow[TMA load to<br/>swizzled input smem<br/>→ Register ops<br/>→ Regular store to output]
    
    OutputFlow --> Schedule[Schedule: tile dimensions,<br/>parallelize, vectorize]
    InputFlow --> Schedule
    Schedule --> End([Execute Kernel])
    Fallback --> End
Loading

Last reviewed commit: 6963ca5

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

NVF_ERROR(grouped_inputs_outputs.size() >= 2);

// When there are more inputs than outputs, output smem transpose should be
// used, however, if it is not, then input smem tranpose will be used, to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tranpose should be transpose

const int64_t cta_per_sm =
dev_props->maxThreadsPerMultiProcessor / threads_per_cta;
const int64_t bytes_per_cta = bytes_per_sm / cta_per_sm;
const int64_t bytes_per_tile = bytes_per_cta / n_input;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add check that n_input > 0 before this division. While the scheduler validation should prevent this, defensive programming would make the code more robust.

Suggested change
const int64_t bytes_per_tile = bytes_per_cta / n_input;
NVF_ERROR(n_input > 0, "Expected at least one TensorView input for transpose");
const int64_t bytes_per_tile = bytes_per_cta / n_input;

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@liqiangxl liqiangxl requested a review from rdspring1 February 27, 2026 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant