Let's make Iris' All-reduce the best! #204

neoblizz · 2025-10-09T05:10:03Z

This pull request introduces a new distributed GEMM benchmark example using all-reduce atomics, refactors and simplifies the matmul wrapper, and adds a Triton kernel for persistent GEMM with all-reduce atomics. The changes improve clarity, usability, and maintainability of the distributed GEMM benchmarking code.

New distributed GEMM benchmark and kernel:

Added a new distributed GEMM benchmark script benchmark.py for the all-reduce atomics example, supporting validation, benchmarking, and tile tracing, with comprehensive argument parsing and logging.
Added a new Triton kernel persistent_gemm_all_reduce in gemm_all_reduce_atomics.py implementing persistent GEMM with all-reduce atomics across ranks, including timestamp tracing and atomic operations for distributed accumulation.

Refactoring and simplification of matmul wrapper:

Refactored matmul_wrapper.py to remove unused parameters and logic related to streamK and synchronization primitives, simplifying function signatures and internal calculations. [1] [2] [3] [4] [5] [6]
Replaced references to the previous kernel import (gemm_atomics_all_reduce) with the new gemm_all_reduce_atomics kernel import in matmul_wrapper.py.
Moved the matmul wrapper and related files from the old directory to examples/08_gemm_all_reduce_atomics/, updating the file structure for clarity. [1] [2]

These changes collectively provide a cleaner, more maintainable, and feature-rich distributed GEMM benchmarking example using Triton and atomics.

Copilot

Pull Request Overview

This PR refactors the Iris all-reduce implementation by introducing a ring-based all-reduce algorithm alongside the existing atomic-based approach. The changes reorganize existing examples into separate directories and implement a cleaner ring-based communication pattern.

Splits the atomic-based all-reduce into its own directory (08_gemm_all_reduce_atomics)
Creates a new ring-based all-reduce implementation (15_gemm_all_reduce_ring_based)
Simplifies the API by removing many configuration parameters and hardcoding architecture-specific values

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`examples/15_gemm_all_reduce_ring_based/matmul_wrapper.py`	Updates wrapper to use ring-based kernel with simplified parameter interface
`examples/15_gemm_all_reduce_ring_based/gemm_all_reduce_ring_based.py`	New ring-based all-reduce kernel implementation using lock-synchronized communication
`examples/15_gemm_all_reduce_ring_based/benchmark.py`	Benchmark script for the new ring-based implementation
`examples/08_gemm_atomics_all_reduce/gemm_atomics_all_reduce.py`	Removes the original atomic-based implementation
`examples/08_gemm_all_reduce_atomics/matmul_wrapper.py`	New wrapper for the refactored atomic-based implementation
`examples/08_gemm_all_reduce_atomics/gemm_all_reduce_atomics.py`	Simplified atomic-based all-reduce implementation
`examples/08_gemm_all_reduce_atomics/benchmark.py`	Updated benchmark script for atomic-based approach

Comments suppressed due to low confidence (1)

examples/15_gemm_all_reduce_ring_based/gemm_all_reduce_ring_based.py:1

Hardcoded architecture-specific values should be replaced with a proper architecture detection and configuration system. Consider creating a lookup table or configuration class based on the arch parameter.

# SPDX-License-Identifier: MIT

Copilot · 2025-10-09T06:03:08Z

examples/15_gemm_all_reduce_ring_based/gemm_all_reduce_ring_based.py

+            # 2) Wait for PREV rank to signal our local flag for this tile
+            #    Spin; single-lane uniform load is fine here.
+            while tl.load(locks + tile_id, cache_modifier=".cv", volatile=True) != 1:
+                pass


This busy-wait loop could consume significant CPU cycles. Consider adding a small delay or using a more efficient synchronization mechanism to reduce CPU usage during the wait period.

Suggested change

pass

tl.device_sleep(100) # Sleep for 100 nanoseconds to reduce busy-wait CPU usage

Copilot · 2025-10-09T06:03:08Z

examples/15_gemm_all_reduce_ring_based/matmul_wrapper.py

+        # TODO: Use arch-specific values.
+        num_stages = 2
+        num_warps = 8
+        waves_per_eu = 0
+        mfma = 16
+        kpack = 1


This TODO comment indicates incomplete implementation. The hardcoded values should be replaced with architecture-specific configuration based on the arch parameter to ensure optimal performance across different GPU architectures.

neoblizz · 2025-10-09T06:05:48Z

@copilot Fix the following issue:

[Iris] [1/2] Validating...
[Iris] [0/2] Validating...
[Iris] [1/2] Max absolute difference: 646.0
[Iris] [0/2] Max absolute difference: 605.5
[Iris] [0/2] Mismatch at index (4641, 1056): C=45.90625, expected=55.25
[Iris] [0/2] Final C validation failed.
[Iris] [1/2] Mismatch at index (21, 3616): C=44.0, expected=22.28125
[Iris] [1/2] Final C validation failed.
[Iris] [1/2] Validation completed
[Iris] [1/2] Benchmarking...
[Iris] [0/2] Validation completed
[Iris] [0/2] Benchmarking...

Copilot

Pull Request Overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

examples/15_gemm_all_reduce_ring_based/benchmark.py

examples/08_gemm_all_reduce_atomics/benchmark.py

Copilot

Pull Request Overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

examples/08_gemm_all_reduce_atomics/matmul_wrapper.py:1

The attribute names streamk_registers and streamk_spills are inconsistent with the new API that uses get_matmul_registers() and get_matmul_spills() methods. These should be updated to use the getter methods or the attributes should be renamed to match the new pattern.

# SPDX-License-Identifier: MIT

Copilot · 2025-10-09T07:01:59Z

examples/15_gemm_all_reduce_ring_based/benchmark.py

-        exit(1)
-
-    tile_completed = shmem.zeros((total_tiles,), device="cuda", dtype=torch.int32)
+    locks = shmem.zeros((total_tiles,), device="cuda", dtype=torch.int32)


The locks tensor is allocated with size total_tiles but should be allocated based on the number of tiles that will actually be processed by each rank. This could lead to memory waste or insufficient locks for proper synchronization.

Suggested change

locks = shmem.zeros((total_tiles,), device="cuda", dtype=torch.int32)

local_blocks_M = total_blocks_M

local_blocks_N = total_blocks_N

local_tiles = local_blocks_M * local_blocks_N

locks = shmem.zeros((local_tiles,), device="cuda", dtype=torch.int32)

Copilot · 2025-10-09T07:02:00Z

examples/15_gemm_all_reduce_ring_based/gemm_all_reduce_ring_based.py

+            tl.debug_barrier()
+
+            # Signal "ready" by setting NEXT rank's flag for this tile to 1
+            iris.atomic_xchg(locks + tile_id, 1, cur_rank, next_rank, heap_bases, sem="release", scope="sys")


The atomic exchange operation uses locks + tile_id but the locks array may not have sufficient elements for all possible tile_id values across different ranks, potentially causing out-of-bounds access.

…r with local data (#217) Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: neoblizz <[email protected]>

…m for all_reduce (#225) Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: neoblizz <[email protected]>

Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: neoblizz <[email protected]> Co-authored-by: Muhammad Osama <[email protected]>

…dation failures (#228) Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: neoblizz <[email protected]>

neoblizz and others added 2 commits October 9, 2025 04:56

Cleaned up All Reduce.

54ee909

Apply Ruff auto-fixes

25e204e

github-actions bot added in-progress We are working on it iris Iris project issue labels Oct 9, 2025

neoblizz and others added 4 commits October 9, 2025 05:46

Naming convention follows all_scatter

c4e03a0

Init impl.

425bc33

Initial impl.

00de58c

Apply Ruff auto-fixes

de4315c

neoblizz marked this pull request as ready for review October 9, 2025 06:02

neoblizz requested a review from mawad-amd as a code owner October 9, 2025 06:02

Copilot AI review requested due to automatic review settings October 9, 2025 06:02

neoblizz requested a review from BKP as a code owner October 9, 2025 06:02

Copilot AI reviewed Oct 9, 2025

View reviewed changes

neoblizz and others added 4 commits October 9, 2025 06:27

Proper acq/rel semantics.

a8acd4b

Merge.

7def2eb

Apply Ruff auto-fixes

d8386a7

Hoist num_xcds query (#205)

6a1f4e2

neoblizz requested a review from Copilot October 9, 2025 06:44

Copilot AI reviewed Oct 9, 2025

View reviewed changes

examples/15_gemm_all_reduce_ring_based/benchmark.py Outdated Show resolved Hide resolved

examples/08_gemm_all_reduce_atomics/benchmark.py Show resolved Hide resolved

neoblizz and others added 2 commits October 9, 2025 06:48

Fix reg/reg spill reporting.

236edec

Apply Ruff auto-fixes

1756006

neoblizz requested a review from Copilot October 9, 2025 07:01

Copilot AI reviewed Oct 9, 2025

View reviewed changes

neoblizz and others added 6 commits October 10, 2025 00:46

Functional.

53f25d8

Apply Ruff auto-fixes

d8ead1e

Merge.

7e05db7

Apply Ruff auto-fixes

5f5ae50

Fix correctness issue in persistent_all_reduce: initialize accumulato…

d7c24a5

…r with local data (#217) Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: neoblizz <[email protected]>

2-kernel working.

ded7317

github-actions bot and others added 8 commits October 10, 2025 16:28

Apply Ruff auto-fixes

e1d3419

Separate all-reduce.

e6c7ecf

[Performance]: Implement optimized ring-based reduce-scatter algorith…

0178b16

…m for all_reduce (#225) Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: neoblizz <[email protected]>

Implement all_reduce-only validation for example 16 (#221)

3288c2d

Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: neoblizz <[email protected]> Co-authored-by: Muhammad Osama <[email protected]>

Apply Ruff auto-fixes

d3eb56a

Correct algo, almost functional.

3820e62

Apply Ruff auto-fixes

6797056

Fix race condition in ring-based all_reduce causing intermittent vali…

875838a

…dation failures (#228) Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: neoblizz <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Let's make Iris' All-reduce the best! #204

Let's make Iris' All-reduce the best! #204

neoblizz commented Oct 9, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 9, 2025

Uh oh!

Copilot AI Oct 9, 2025

Uh oh!

neoblizz commented Oct 9, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 9, 2025

Uh oh!

Copilot AI Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	pass
	tl.device_sleep(100) # Sleep for 100 nanoseconds to reduce busy-wait CPU usage

-    locks = shmem.zeros((total_tiles,), device="cuda", dtype=torch.int32)
+    local_blocks_M = total_blocks_M
+    local_blocks_N = total_blocks_N
+    local_tiles = local_blocks_M * local_blocks_N
+    locks = shmem.zeros((local_tiles,), device="cuda", dtype=torch.int32)

Let's make Iris' All-reduce the best! #204

Are you sure you want to change the base?

Let's make Iris' All-reduce the best! #204

Conversation

neoblizz commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

neoblizz commented Oct 9, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neoblizz commented Oct 9, 2025 •

edited

Loading