[Multidevice] Tma bulk copy p2p runtime examples#6011
[Multidevice] Tma bulk copy p2p runtime examples#6011samnordmann wants to merge 5 commits intomainfrom
Conversation
|
Review updated until commit ae0c760 Description
|
| Relevant files | |||
|---|---|---|---|
| Tests |
| ||
| Enhancement |
| ||
| Configuration changes |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Kernel Robustness
|
Greptile SummaryAdded Hopper TMA (
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant T0 as Thread 0
participant SMEM as Shared Memory
participant MBAR as MBarrier
participant GMEM_SRC as GMEM(src)
participant GMEM_DST as GMEM(dst)
T0->>MBAR: mbarrier.init(arrival_count=1)
T0->>T0: fence.mbarrier_init.release.cluster
T0->>T0: __syncwarp()
Note over T0,GMEM_SRC: TMA Load Phase
T0->>MBAR: mbarrier.arrive.expect_tx(num_bytes)
T0->>GMEM_SRC: cp.async.bulk (TMA load)
GMEM_SRC-->>SMEM: async data transfer
SMEM->>MBAR: complete_tx notification
T0->>MBAR: mbarrier.try_wait.parity (spin until done)
MBAR-->>T0: phase flip (load complete)
Note over T0,GMEM_DST: TMA Store Phase
T0->>SMEM: cp.async.bulk.global.shared
SMEM-->>GMEM_DST: async data transfer
T0->>T0: cp.async.bulk.commit_group
T0->>T0: cp.async.bulk.wait_group.read 0
T0->>MBAR: mbarrier.inval
Last reviewed commit: ae0c760 |
|
!test |
They are mostly just wrappers around some PTX instructions. We could add IR nodes to the Kernel IR and still use them for simpler final codegen ( The overall design philosophy is to generate the Kernel IR that explicitly represents the final CUDA kernel and minimize the logic necessary in |
|
Ok regarding code gen, however, this pr is not about code gen. The present tma kernel is used as a "host op" to perform inter-GPU comms, similarly to a cudaMemcpyAsync. This PR provides a reference implementation and the next one adds this transport as a possible p2p backend. I am not sure to understand -- are you ok with the pr's current implementation or do you suggest something else? |
What
Add a Hopper TMA (
cp.async.bulk) copy kernel incsrc/multidevice/tma_copy.cuand validate it across three memory source/destination types:Those behavior are demonstrated through three unit tests at
tests/cpp/test_multidevice_tma.cpp. The tests reuse theSymmetricTensorabstraction for VMM allocation, IPC handle exchange, and multicast setup, keeping the test bodies focused on the TMA transfer itself.Why
The CUDA backend for multi-device communication (
csrc/multidevice/cuda_p2p.cpp) currently uses SM-based copies (regular threads load/store ormultimem.st) and copy-engine copies (cudaMemcpyAsync/cudaMemcpyBatchAsync). TMA offers a third transport option that is GPU-initiated, lightweight (single-thread issue), fully asynchronous, and frees SM resources for overlapping compute. This transport is leveraged by DeepEP for intra-node MoE dispatch. This PR validates that TMA works correctly on the memory types used by nvFuser's multi-device infrastructure.This lays the groundwork for a follow-up PR that integrates TMA as a transport option for P2P and multicast communications alongside the existing SM-based copies and copy-engine transports.
How
csrc/multidevice/tma_copy.cu. It is a single-warp kernel where thread 0 performs a two-phase TMA transfer through shared memory (GMEM(src) --[TMA load]--> SMEM --[TMA store]--> GMEM(dst)), usingmbarrierfor async completion tracking. TMA is a GMEM-SMEM engine — there is no GMEM-to-GMEM variant, so shared memory staging is inherent to the hardware.alltoallv.cu,multicast.cukernels incuda_p2p.cpp, and other kernels inruntime/) and stringified at build time through the existingNVFUSER_RUNTIME_FILESpipeline.