Open
Conversation
Collaborator
Author
|
!test |
Collaborator
Author
|
!test |
Contributor
Greptile SummaryThis PR implements an automatic TMA (Tensor Memory Accelerator) transpose scheduler that optimizes transpose operations by choosing between input and output shared memory transpose paths based on the number of inputs vs outputs. Key changes:
Minor issues found:
Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
Start([TMA Transpose Scheduler Entry]) --> CheckEnabled{TmaTranspose<br/>enabled?}
CheckEnabled -->|No| Fallback[Use non-TMA<br/>transpose scheduler]
CheckEnabled -->|Yes| CountIO[Count inputs<br/>and outputs]
CountIO --> CompareIO{n_input > n_output?}
CompareIO -->|Yes| OutputPath[Output Smem Transpose Path]
CompareIO -->|No| InputPath[Input Smem Transpose Path]
OutputPath --> OutputConfig[use_tma_load = true<br/>use_tma_store = true<br/>is_output_smem_transpose = true]
OutputConfig --> OutputFlow[TMA load inputs<br/>without swizzle<br/>→ Register ops<br/>→ TMA store to<br/>swizzled output smem]
InputPath --> InputConfig[use_tma_load = true<br/>use_tma_store = false<br/>is_output_smem_transpose = false]
InputConfig --> InputFlow[TMA load to<br/>swizzled input smem<br/>→ Register ops<br/>→ Regular store to output]
OutputFlow --> Schedule[Schedule: tile dimensions,<br/>parallelize, vectorize]
InputFlow --> Schedule
Schedule --> End([Execute Kernel])
Fallback --> End
Last reviewed commit: 6963ca5 |
| NVF_ERROR(grouped_inputs_outputs.size() >= 2); | ||
|
|
||
| // When there are more inputs than outputs, output smem transpose should be | ||
| // used, however, if it is not, then input smem tranpose will be used, to |
Contributor
There was a problem hiding this comment.
tranpose should be transpose
| const int64_t cta_per_sm = | ||
| dev_props->maxThreadsPerMultiProcessor / threads_per_cta; | ||
| const int64_t bytes_per_cta = bytes_per_sm / cta_per_sm; | ||
| const int64_t bytes_per_tile = bytes_per_cta / n_input; |
Contributor
There was a problem hiding this comment.
Add check that n_input > 0 before this division. While the scheduler validation should prevent this, defensive programming would make the code more robust.
Suggested change
| const int64_t bytes_per_tile = bytes_per_cta / n_input; | |
| NVF_ERROR(n_input > 0, "Expected at least one TensorView input for transpose"); | |
| const int64_t bytes_per_tile = bytes_per_cta / n_input; |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
To reduce number of tranpose ops,
is_output_smem_transposeis added to control input/output transpose:1. When there are more inputs than outputs,
is_output_smem_transpose = TrueTMA load without swizzle, TMA store with swizzle, transpose at
regs --> output cached smem2. When there are less inputs than outputs,
is_output_smem_transpose = FalseTMA load with swizzle, register store, transpose at
input cached smem -> regsCurrent performance is in this doc.