Skip to content

Conversation

neoblizz
Copy link
Member

@neoblizz neoblizz commented Oct 9, 2025

This pull request introduces a new distributed GEMM benchmark example using all-reduce atomics, refactors and simplifies the matmul wrapper, and adds a Triton kernel for persistent GEMM with all-reduce atomics. The changes improve clarity, usability, and maintainability of the distributed GEMM benchmarking code.

New distributed GEMM benchmark and kernel:

  • Added a new distributed GEMM benchmark script benchmark.py for the all-reduce atomics example, supporting validation, benchmarking, and tile tracing, with comprehensive argument parsing and logging.
  • Added a new Triton kernel persistent_gemm_all_reduce in gemm_all_reduce_atomics.py implementing persistent GEMM with all-reduce atomics across ranks, including timestamp tracing and atomic operations for distributed accumulation.

Refactoring and simplification of matmul wrapper:

  • Refactored matmul_wrapper.py to remove unused parameters and logic related to streamK and synchronization primitives, simplifying function signatures and internal calculations. [1] [2] [3] [4] [5] [6]
  • Replaced references to the previous kernel import (gemm_atomics_all_reduce) with the new gemm_all_reduce_atomics kernel import in matmul_wrapper.py.
  • Moved the matmul wrapper and related files from the old directory to examples/08_gemm_all_reduce_atomics/, updating the file structure for clarity. [1] [2]

These changes collectively provide a cleaner, more maintainable, and feature-rich distributed GEMM benchmarking example using Triton and atomics.

@github-actions github-actions bot added in-progress We are working on it iris Iris project issue labels Oct 9, 2025
@neoblizz neoblizz marked this pull request as ready for review October 9, 2025 06:02
@neoblizz neoblizz requested a review from mawad-amd as a code owner October 9, 2025 06:02
@Copilot Copilot AI review requested due to automatic review settings October 9, 2025 06:02
@neoblizz neoblizz requested a review from BKP as a code owner October 9, 2025 06:02
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the Iris all-reduce implementation by introducing a ring-based all-reduce algorithm alongside the existing atomic-based approach. The changes reorganize existing examples into separate directories and implement a cleaner ring-based communication pattern.

  • Splits the atomic-based all-reduce into its own directory (08_gemm_all_reduce_atomics)
  • Creates a new ring-based all-reduce implementation (15_gemm_all_reduce_ring_based)
  • Simplifies the API by removing many configuration parameters and hardcoding architecture-specific values

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
examples/15_gemm_all_reduce_ring_based/matmul_wrapper.py Updates wrapper to use ring-based kernel with simplified parameter interface
examples/15_gemm_all_reduce_ring_based/gemm_all_reduce_ring_based.py New ring-based all-reduce kernel implementation using lock-synchronized communication
examples/15_gemm_all_reduce_ring_based/benchmark.py Benchmark script for the new ring-based implementation
examples/08_gemm_atomics_all_reduce/gemm_atomics_all_reduce.py Removes the original atomic-based implementation
examples/08_gemm_all_reduce_atomics/matmul_wrapper.py New wrapper for the refactored atomic-based implementation
examples/08_gemm_all_reduce_atomics/gemm_all_reduce_atomics.py Simplified atomic-based all-reduce implementation
examples/08_gemm_all_reduce_atomics/benchmark.py Updated benchmark script for atomic-based approach
Comments suppressed due to low confidence (1)

examples/15_gemm_all_reduce_ring_based/gemm_all_reduce_ring_based.py:1

  • Hardcoded architecture-specific values should be replaced with a proper architecture detection and configuration system. Consider creating a lookup table or configuration class based on the arch parameter.
# SPDX-License-Identifier: MIT

# 2) Wait for PREV rank to signal our local flag for this tile
# Spin; single-lane uniform load is fine here.
while tl.load(locks + tile_id, cache_modifier=".cv", volatile=True) != 1:
pass
Copy link

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This busy-wait loop could consume significant CPU cycles. Consider adding a small delay or using a more efficient synchronization mechanism to reduce CPU usage during the wait period.

Suggested change
pass
tl.device_sleep(100) # Sleep for 100 nanoseconds to reduce busy-wait CPU usage

Copilot uses AI. Check for mistakes.

Comment on lines +59 to +64
# TODO: Use arch-specific values.
num_stages = 2
num_warps = 8
waves_per_eu = 0
mfma = 16
kpack = 1
Copy link

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO comment indicates incomplete implementation. The hardcoded values should be replaced with architecture-specific configuration based on the arch parameter to ensure optimal performance across different GPU architectures.

Copilot uses AI. Check for mistakes.

@neoblizz
Copy link
Member Author

neoblizz commented Oct 9, 2025

@copilot Fix the following issue:

[Iris] [1/2] Validating...
[Iris] [0/2] Validating...
[Iris] [1/2] Max absolute difference: 646.0
[Iris] [0/2] Max absolute difference: 605.5
[Iris] [0/2] Mismatch at index (4641, 1056): C=45.90625, expected=55.25
[Iris] [0/2] Final C validation failed.
[Iris] [1/2] Mismatch at index (21, 3616): C=44.0, expected=22.28125
[Iris] [1/2] Final C validation failed.
[Iris] [1/2] Validation completed
[Iris] [1/2] Benchmarking...
[Iris] [0/2] Validation completed
[Iris] [0/2] Benchmarking...

@neoblizz neoblizz requested a review from Copilot October 9, 2025 06:44
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

@neoblizz neoblizz requested a review from Copilot October 9, 2025 07:01
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

examples/08_gemm_all_reduce_atomics/matmul_wrapper.py:1

  • The attribute names streamk_registers and streamk_spills are inconsistent with the new API that uses get_matmul_registers() and get_matmul_spills() methods. These should be updated to use the getter methods or the attributes should be renamed to match the new pattern.
# SPDX-License-Identifier: MIT

exit(1)

tile_completed = shmem.zeros((total_tiles,), device="cuda", dtype=torch.int32)
locks = shmem.zeros((total_tiles,), device="cuda", dtype=torch.int32)
Copy link

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The locks tensor is allocated with size total_tiles but should be allocated based on the number of tiles that will actually be processed by each rank. This could lead to memory waste or insufficient locks for proper synchronization.

Suggested change
locks = shmem.zeros((total_tiles,), device="cuda", dtype=torch.int32)
local_blocks_M = total_blocks_M
local_blocks_N = total_blocks_N
local_tiles = local_blocks_M * local_blocks_N
locks = shmem.zeros((local_tiles,), device="cuda", dtype=torch.int32)

Copilot uses AI. Check for mistakes.

tl.debug_barrier()

# Signal "ready" by setting NEXT rank's flag for this tile to 1
iris.atomic_xchg(locks + tile_id, 1, cur_rank, next_rank, heap_bases, sem="release", scope="sys")
Copy link

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The atomic exchange operation uses locks + tile_id but the locks array may not have sufficient elements for all possible tile_id values across different ranks, potentially causing out-of-bounds access.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants