Skip to content

Conversation

Copilot
Copy link
Contributor

@Copilot Copilot AI commented Oct 2, 2025

Gluon Port for Iris - COMPLETE ✅

Successfully completed the Gluon port of Iris using proper Gluon with @gluon.jit decorator!

This PR implements a true Gluon-based API for Iris following the proper pattern with @aggregate, @gluon.jit, and gl.* language primitives. The implementation is located in the experimental directory to clearly indicate that this API may evolve in future releases.


📊 Implementation Summary

Lines of Code

  • Total: ~1,900+ lines (implementation + examples + documentation)
  • iris_gluon.py: 670+ lines with @gluon.jit methods (in experimental/)
  • Producer-consumer example: Updated to use iris.experimental.iris_gluon
  • GEMM all-scatter package: Complete implementation with kernel, wrapper, benchmark, and docs
  • Documentation: Streamlined with API comparison guide

Files Created/Modified

  1. iris/experimental/iris_gluon.py - Complete Gluon implementation

    • IrisDeviceCtx aggregate with @gluon.jit methods
    • IrisDeviceCtx.initialize() decodes context tensor
    • All methods use gl.* language primitives
    • IrisGluon.get_device_context() returns encoded tensor
    • Includes copy() method for direct rank-to-rank data transfers
  2. iris/experimental/init.py - Experimental module initialization

  3. examples/06_message_passing/message_passing_gluon.py

    • Updated to import from iris.experimental.iris_gluon
    • Kernels use @gluon.jit decorator
    • Use gl.* primitives (gl.load, gl.store, gl.atomic_cas, etc.)
  4. examples/07_gemm_all_scatter/gluon/ - Complete package

    • Updated all imports to iris.experimental.iris_gluon
    • gemm_all_scatter.py: Core Gluon GEMM kernel with persistent pattern
    • matmul_wrapper.py: PyTorch autograd wrapper using IrisDeviceCtx
    • benchmark.py: Complete benchmark script with validation and performance testing
    • README.md: Documentation and usage guide
    • init.py: Package initialization
  5. tests/unittests/test_iris_gluon.py - Unit tests (updated imports)

  6. docs/api-comparison.md - Side-by-side API comparison and migration guide (includes copy() examples)

  7. iris/init.py - Exposed experimental module

  8. README.md - Added experimental Gluon API section with proper layout examples


🎯 Key Features

IrisDeviceCtx Aggregate with Gluon

  • Uses @aggregate decorator
  • initialize() method with @gluon.jit decodes context tensor
  • 15 device methods all using @gluon.jit and gl.* primitives:
    • Memory ops: load(), store(), get(), put(), copy()
    • Atomics: atomic_add(), atomic_sub(), atomic_cas(), atomic_xchg(), atomic_xor(), atomic_and(), atomic_or(), atomic_min(), atomic_max()

Examples Ported to Gluon

  1. Producer-Consumer (message_passing_gluon.py)

    • Basic inter-rank communication pattern
    • Demonstrates load/store/atomic operations
  2. GEMM All-Scatter (gluon/ subdirectory)

    • Complete package with kernel, wrapper, and benchmark
    • Complex GEMM computation with distributed results
    • Persistent kernel pattern
    • Optimized tiling and blocking
    • All-scatter communication pattern
    • Full validation and benchmarking support

API Pattern

Host Side:

import iris.experimental.iris_gluon as iris_gl

ctx = iris_gl.iris(heap_size=2**30)
context_tensor = ctx.get_device_context()  # Encode: [cur_rank, num_ranks, heap_bases...]

Device Side:

from triton.experimental import gluon
from triton.experimental.gluon import language as gl

@gluon.jit
def kernel(IrisDeviceCtx: gl.constexpr, context_tensor, ...):
    ctx = IrisDeviceCtx.initialize(context_tensor)  # Decode
    layout: gl.constexpr = gl.BlockedLayout([1], [64], [1], [0])
    offsets = gl.arange(0, size, layout=layout)
    ctx.load(buffer, 1)  # Use gl.* internally
    ctx.copy(src_ptr, dst_ptr, from_rank, to_rank)  # Direct rank-to-rank copy

✅ Benefits

  1. True Gluon Implementation - Uses @gluon.jit and gl.* primitives
  2. Context Encoding - Efficient tensor-based context passing
  3. Clean Initialization - Single initialize() call decodes context
  4. Type Safety - Clear IrisDeviceCtx: gl.constexpr contract
  5. Backward Compatible - Original API unchanged
  6. Well Documented - Complete examples and API comparison guide
  7. Complex Examples - Both simple and complex patterns demonstrated
  8. Clearly Marked as Experimental - In dedicated experimental/ directory
  9. Complete Feature Parity - All operations from main Iris API including copy()

📚 Documentation


🧪 Testing Status

✅ Completed

  • Syntax validation (all files compile)
  • Structure validation (unit tests)
  • Example code (producer-consumer + complete GEMM package)
  • Documentation complete
  • Organized as experimental feature

⏳ Pending

  • Full GPU execution (requires PyTorch/ROCm + Gluon support)
  • Multi-rank testing (requires distributed setup)
  • Performance benchmarking

🚀 Usage

GEMM All-Scatter Example

cd examples/07_gemm_all_scatter/gluon
python benchmark.py -m 8192 -n 4608 -k 36864 --validate --benchmark -r 2

API Usage

import iris.experimental.iris_gluon as iris_gl
from triton.experimental import gluon
from triton.experimental.gluon import language as gl

# Host side
ctx = iris_gl.iris(heap_size=2**30)
context_tensor = ctx.get_device_context()
buffer = ctx.zeros(1024, dtype=torch.float32)

# Device side
@gluon.jit
def my_kernel(IrisDeviceCtx: gl.constexpr, context_tensor):
    ctx = IrisDeviceCtx.initialize(context_tensor)
    layout: gl.constexpr = gl.BlockedLayout([1], [64], [1], [0])
    offsets = gl.arange(0, 1024, layout=layout)
    ctx.load(buffer, 1)  # Clean and proper Gluon!
    ctx.copy(src, dst, 0, 1)  # Copy between ranks
    
# Launch
my_kernel[(1,)](iris_gl.IrisDeviceCtx, context_tensor, num_warps=1)

🎓 Technical Notes

  • Uses @gluon.jit for all device methods
  • Uses gl.* language primitives (gl.load, gl.store, gl.atomic_*, gl.dot, etc.)
  • Context encoded as tensor: [cur_rank, num_ranks, heap_base_0, heap_base_1, ...]
  • IrisDeviceCtx.initialize() decodes the tensor
  • Full feature parity with original Iris including copy() method
  • Current rank automatically used from decoded context
  • Uses gl.BlockedLayout for gl.arange() operations (AMD: 64 threads/warp)
  • Located in experimental/ directory to indicate evolving API

📈 Impact

This implementation properly uses Gluon's programming model, making it a true Gluon port rather than just using the @aggregate decorator. It follows the established patterns from Triton's Gluon examples and demonstrates both simple (message passing) and complex (GEMM all-scatter) use cases with complete supporting infrastructure including kernels, wrappers, benchmarks, and documentation.

By placing it in the experimental directory, users are clearly informed that this API may evolve while the stable Iris API remains unchanged.


✨ Ready for Testing

The implementation follows the proper Gluon pattern as requested and is now clearly marked as experimental. All code compiles and is ready for testing in a GPU environment with Gluon support. Both simple and complex examples are provided with complete supporting files (kernels, wrappers, benchmarks, documentation) organized in dedicated subdirectories.

Fixes #184

Original prompt

This section details on the original issue you should resolve

<issue_title>[Feature]: Gluon Port</issue_title>
<issue_description>### Suggestion Description

Port the backend of Iris to Gluon. TODOs:

  1. Use @aggregate to encapsulate the Iris backend struct so that we don't need to pass the heap bases around. Instead we use the Iris object.
  2. Port the entire iris.py to Gluon
  3. Port the producer consumer example only to Gluon

Some resources to look at:
https://github.com/triton-lang/triton/tree/main/python/tutorials/gluon
https://github.com/triton-lang/triton/tree/main/python/examples/gluon
https://raw.githubusercontent.com/triton-lang/triton/refs/heads/main/python/examples/gluon/01-attention-forward.py
https://raw.githubusercontent.com/triton-lang/triton/refs/heads/main/python/tutorials/gluon/01-intro.py
https://raw.githubusercontent.com/triton-lang/triton/refs/heads/main/python/tutorials/gluon/02-layouts.py
https://raw.githubusercontent.com/triton-lang/triton/refs/heads/main/python/tutorials/gluon/03-async-copy.py
https://raw.githubusercontent.com/triton-lang/triton/refs/heads/main/python/tutorials/gluon/04-tma.py
https://raw.githubusercontent.com/triton-lang/triton/refs/heads/main/python/tutorials/gluon/05-wgmma.py
https://raw.githubusercontent.com/triton-lang/triton/refs/heads/main/python/tutorials/gluon/06-tcgen05.py
https://raw.githubusercontent.com/triton-lang/triton/refs/heads/main/python/tutorials/gluon/07-persistence.py
https://raw.githubusercontent.com/triton-lang/triton/refs/heads/main/python/tutorials/gluon/08-warp-specialization.py

Operating System

No response

GPU

No response

ROCm Component

No response</issue_description>

<agent_instructions>Read the examples in the links to better understand Gluon. Only Port iris.py and producer consumer simple example.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

Fixes #184

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@Copilot Copilot AI changed the title [WIP] [Feature]: Gluon Port [Feature] Add Gluon-style Aggregate API for Iris Backend Oct 2, 2025
@Copilot Copilot AI requested a review from mawad-amd October 2, 2025 23:07
Copilot finished work on behalf of mawad-amd October 2, 2025 23:07
The backend aggregate now uses self.cur_rank internally, so users only need
to specify the remote rank for operations. This makes the API much cleaner:
- load(ptr, from_rank) instead of load(ptr, to_rank, from_rank)
- store(ptr, val, to_rank) instead of store(ptr, val, from_rank, to_rank)
- atomic_*(ptr, val, to_rank) instead of atomic_*(ptr, val, from_rank, to_rank)
- get(from_ptr, to_ptr, from_rank) instead of get(from_ptr, to_ptr, from_rank, to_rank)
- put(from_ptr, to_ptr, to_rank) instead of put(from_ptr, to_ptr, from_rank, to_rank)

Updated all documentation and examples to reflect the simplified API.

Co-authored-by: mawad-amd <[email protected]>
Copilot finished work on behalf of mawad-amd October 2, 2025 23:16
Following the example provided, the implementation now:
- Uses @gluon.jit decorator for all device-side methods
- Uses gl.* language primitives instead of tl.*
- IrisDeviceCtx.initialize() decodes context tensor: [cur_rank, num_ranks, heap_bases...]
- Host-side get_device_context() returns encoded tensor
- Kernels receive IrisDeviceCtx class and context_tensor as parameters
- Updated producer-consumer example to use proper Gluon pattern

Co-authored-by: mawad-amd <[email protected]>
Copilot finished work on behalf of mawad-amd October 3, 2025 06:21
@mawad-amd
Copy link
Collaborator

@copilot this is looking great now. I fixed a couple of bugs in both the message passing example and the Iris class in my commits. Read my changes (passed 6 commits). Next task is to port example 07_gemm_all_scatter. This is more difficult -- consult the references for guidance.

Created gemm_all_scatter_gluon.py following the Gluon pattern:
- Uses @gluon.jit decorator for the kernel
- IrisDeviceCtx.initialize() decodes context tensor
- Uses gl.* language primitives (gl.load, gl.store, gl.dot, gl.atomic_*, etc.)
- Uses gl.BlockedLayout for gl.arange operations (64 threads per warp for AMD)
- Context methods (ctx.store) for inter-rank communication
- Maintains all GEMM optimizations from original example

Co-authored-by: mawad-amd <[email protected]>
Copy link
Contributor Author

Copilot AI commented Oct 3, 2025

@copilot this is looking great now. I fixed a couple of bugs in both the message passing example and the Iris class in my commits. Read my changes (passed 6 commits). Next task is to port example 07_g...

Done! Ported the 07_gemm_all_scatter example to Gluon.

Key implementation details:

  • Uses @gluon.jit decorator for the kernel
  • IrisDeviceCtx.initialize(context_tensor) decodes context
  • All GEMM operations use gl.* primitives: gl.load, gl.store, gl.dot, gl.zeros
  • Uses gl.BlockedLayout([1], [64], [1], [0]) for gl.arange operations
  • Inter-rank scatter via ctx.store(c_global + offset, c, remote_rank, mask)
  • Maintains all optimizations: persistent kernel, tiling, blocking, compiler hints

The example demonstrates complex GEMM computation with distributed all-scatter communication pattern using proper Gluon.

Commit: 8785f2a

Created examples/07_gemm_all_scatter/gluon/ subdirectory containing:
- gemm_all_scatter.py: Core Gluon GEMM kernel (moved from parent dir)
- matmul_wrapper.py: PyTorch autograd wrapper using IrisDeviceCtx
- benchmark.py: Complete benchmark script with validation
- README.md: Documentation for the Gluon port
- __init__.py: Package initialization

All files use proper Gluon patterns:
- @gluon.jit decorator
- IrisDeviceCtx.initialize(context_tensor)
- gl.* language primitives
- Context tensor instead of heap_bases parameter

Co-authored-by: mawad-amd <[email protected]>
@Copilot Copilot AI requested a review from mawad-amd October 3, 2025 07:24
Copilot finished work on behalf of mawad-amd October 3, 2025 07:24
@mawad-amd
Copy link
Collaborator

@copilot move the backend inside an experimental directory.

Reorganized the Gluon implementation as an experimental feature:
- Created iris/experimental/ subdirectory
- Moved iris_gluon.py to iris/experimental/iris_gluon.py
- Updated all imports from iris.iris_gluon to iris.experimental.iris_gluon
- Updated iris/__init__.py to import experimental module
- Updated all examples: message_passing_gluon.py, GEMM gluon package
- Updated tests: test_iris_gluon.py
- Updated documentation: README.md and all docs/*.md files

This change marks the Gluon API as experimental, clearly indicating that
it may undergo changes in future releases while maintaining backward
compatibility with the stable Iris API.

Co-authored-by: mawad-amd <[email protected]>
Copy link
Contributor Author

Copilot AI commented Oct 3, 2025

@copilot move the backend inside an experimental directory.

Done! Moved the Gluon backend to the experimental directory:

New structure:

iris/
├── experimental/
│   ├── __init__.py
│   └── iris_gluon.py    # Gluon implementation
└── ...

Updated imports:

  • Changed from import iris.iris_gluon to import iris.experimental.iris_gluon
  • Updated all examples (message_passing_gluon.py, GEMM gluon package)
  • Updated tests (test_iris_gluon.py)
  • Updated documentation (README.md, docs/*.md)

This clearly marks the Gluon API as experimental.

Commit: 63e0e95

@Copilot Copilot AI changed the title [Feature] Add Gluon-style Aggregate API for Iris Backend [Feature] Add Experimental Gluon-style Aggregate API for Iris Backend Oct 3, 2025
Copilot finished work on behalf of mawad-amd October 3, 2025 16:10
@@ -0,0 +1,291 @@
# Iris Gluon Port - Implementation Summary

## Overview
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot remove this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed docs/gluon-implementation-summary.md and updated all references in docs/GLUON-PORT-REPORT.md.

Commit: (will be in next push)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed docs/gluon-implementation-summary.md and updated all references in docs/GLUON-PORT-REPORT.md.

Commit: 2e2b13a


This directory contains the Gluon-based implementation of Iris, which uses Triton's `@aggregate` decorator to encapsulate the Iris backend state.

## Overview
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot remove this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed docs/gluon-port-readme.md.

Commit: (in next push)


## Deliverables

### Code Implementation (1,033 lines)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Remove this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed docs/GLUON-PORT-REPORT.md.

Commit: (in next push)

README.md Outdated
- **SHMEM-like RMA**: Iris provides SHMEM-like RMA support in Triton.
- **Simple and Intuitive API**: Iris provides simple and intuitive RMA APIs. Writing multi-GPU programs is as easy as writing single-GPU programs.
- **Triton-based**: Iris is built on top of Triton and inherits Triton's performance and capabilities.
- **Gluon-style Aggregate API**: Optional cleaner API using Triton's `@aggregate` decorator for better encapsulation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot add note saying this is experimental

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "(Experimental)" label to the Gluon-style Aggregate API feature and section header in README.

Commit: (in next push)

README.md Outdated
backend: iris_gl.IrisBackend):
pid = tl.program_id(0)
block_start = pid * block_size
offsets = block_start + tl.arange(0, block_size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Missing the layout. See example code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the layout parameter to the gl.arange() call in the README example:

layout: gl.constexpr = gl.BlockedLayout([1], [64], [1], [0])
offsets = block_start + gl.arange(0, block_size, layout=layout)

Commit: (in next push)

Removed docs/gluon-implementation-summary.md as requested and updated
all references in docs/GLUON-PORT-REPORT.md to reflect the removal.
The remaining documentation (api-comparison.md and gluon-port-readme.md)
provides sufficient coverage for the Gluon implementation.

Co-authored-by: mawad-amd <[email protected]>
@Copilot Copilot AI requested a review from mawad-amd October 3, 2025 22:18
Copilot finished work on behalf of mawad-amd October 3, 2025 22:18
- Removed docs/gluon-port-readme.md and docs/GLUON-PORT-REPORT.md
- Added (Experimental) label to Gluon API feature in README
- Updated Gluon example in README to include proper layout usage
- Updated documentation links to reflect removed files
- Marked Gluon API section header as experimental

Co-authored-by: mawad-amd <[email protected]>
Copilot finished work on behalf of mawad-amd October 3, 2025 22:23
@mawad-amd
Copy link
Collaborator

@copilot main has a new function called copy, port it to gluon.

Ported the copy() function from main iris.py to the Gluon implementation.
The method enables direct copying between any two ranks where the current
rank must be either the source or destination.

Changes:
- Added copy() method to IrisDeviceCtx with @gluon.jit decorator
- Uses gl.load and gl.store with proper address translation
- Updated docs/api-comparison.md with copy() usage examples
- Maintains same semantics as original iris.copy() function

Co-authored-by: mawad-amd <[email protected]>
Copy link
Contributor Author

Copilot AI commented Oct 11, 2025

@copilot main has a new function called copy, port it to gluon.

Done! Added the copy() method to the Gluon IrisDeviceCtx class. The method enables direct copying between any two ranks where the current rank must be either the source or destination.

Key implementation:

  • Uses gl.load and gl.store with proper address translation
  • Takes parameters: src_ptr, dst_ptr, from_rank, to_rank, optional mask
  • Current rank is automatically used from self.cur_rank in the backend
  • Updated docs/api-comparison.md with usage examples

Commit: (in progress)

@copilot main has a new function called copy, port it to gluon.

Done! Added the copy() method to the Gluon IrisDeviceCtx class. The method enables direct copying between any two ranks where the current rank must be either the source or destination.

Key implementation:

  • Uses gl.load and gl.store with proper address translation
  • Takes parameters: src_ptr, dst_ptr, from_rank, to_rank, optional mask
  • Current rank is automatically used from self.cur_rank in the backend
  • Updated docs/api-comparison.md with usage examples

Commit: 13cd3f2

Copilot finished work on behalf of mawad-amd October 11, 2025 05:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Gluon Port

2 participants