-
Notifications
You must be signed in to change notification settings - Fork 19
[Feature] Add Experimental Gluon-style Aggregate API for Iris Backend #185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: mawad-amd <[email protected]>
Co-authored-by: mawad-amd <[email protected]>
Co-authored-by: mawad-amd <[email protected]>
Co-authored-by: mawad-amd <[email protected]>
Co-authored-by: mawad-amd <[email protected]>
Co-authored-by: mawad-amd <[email protected]>
The backend aggregate now uses self.cur_rank internally, so users only need to specify the remote rank for operations. This makes the API much cleaner: - load(ptr, from_rank) instead of load(ptr, to_rank, from_rank) - store(ptr, val, to_rank) instead of store(ptr, val, from_rank, to_rank) - atomic_*(ptr, val, to_rank) instead of atomic_*(ptr, val, from_rank, to_rank) - get(from_ptr, to_ptr, from_rank) instead of get(from_ptr, to_ptr, from_rank, to_rank) - put(from_ptr, to_ptr, to_rank) instead of put(from_ptr, to_ptr, from_rank, to_rank) Updated all documentation and examples to reflect the simplified API. Co-authored-by: mawad-amd <[email protected]>
Following the example provided, the implementation now: - Uses @gluon.jit decorator for all device-side methods - Uses gl.* language primitives instead of tl.* - IrisDeviceCtx.initialize() decodes context tensor: [cur_rank, num_ranks, heap_bases...] - Host-side get_device_context() returns encoded tensor - Kernels receive IrisDeviceCtx class and context_tensor as parameters - Updated producer-consumer example to use proper Gluon pattern Co-authored-by: mawad-amd <[email protected]>
@copilot this is looking great now. I fixed a couple of bugs in both the message passing example and the Iris class in my commits. Read my changes (passed 6 commits). Next task is to port example 07_gemm_all_scatter. This is more difficult -- consult the references for guidance. |
Created gemm_all_scatter_gluon.py following the Gluon pattern: - Uses @gluon.jit decorator for the kernel - IrisDeviceCtx.initialize() decodes context tensor - Uses gl.* language primitives (gl.load, gl.store, gl.dot, gl.atomic_*, etc.) - Uses gl.BlockedLayout for gl.arange operations (64 threads per warp for AMD) - Context methods (ctx.store) for inter-rank communication - Maintains all GEMM optimizations from original example Co-authored-by: mawad-amd <[email protected]>
Done! Ported the Key implementation details:
The example demonstrates complex GEMM computation with distributed all-scatter communication pattern using proper Gluon. Commit: 8785f2a |
Created examples/07_gemm_all_scatter/gluon/ subdirectory containing: - gemm_all_scatter.py: Core Gluon GEMM kernel (moved from parent dir) - matmul_wrapper.py: PyTorch autograd wrapper using IrisDeviceCtx - benchmark.py: Complete benchmark script with validation - README.md: Documentation for the Gluon port - __init__.py: Package initialization All files use proper Gluon patterns: - @gluon.jit decorator - IrisDeviceCtx.initialize(context_tensor) - gl.* language primitives - Context tensor instead of heap_bases parameter Co-authored-by: mawad-amd <[email protected]>
@copilot move the backend inside an experimental directory. |
Reorganized the Gluon implementation as an experimental feature: - Created iris/experimental/ subdirectory - Moved iris_gluon.py to iris/experimental/iris_gluon.py - Updated all imports from iris.iris_gluon to iris.experimental.iris_gluon - Updated iris/__init__.py to import experimental module - Updated all examples: message_passing_gluon.py, GEMM gluon package - Updated tests: test_iris_gluon.py - Updated documentation: README.md and all docs/*.md files This change marks the Gluon API as experimental, clearly indicating that it may undergo changes in future releases while maintaining backward compatibility with the stable Iris API. Co-authored-by: mawad-amd <[email protected]>
Done! Moved the Gluon backend to the experimental directory: New structure:
Updated imports:
This clearly marks the Gluon API as experimental. Commit: 63e0e95 |
docs/gluon-implementation-summary.md
Outdated
@@ -0,0 +1,291 @@ | |||
# Iris Gluon Port - Implementation Summary | |||
|
|||
## Overview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot remove this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Removed docs/gluon-implementation-summary.md and updated all references in docs/GLUON-PORT-REPORT.md.
Commit: (will be in next push)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Removed docs/gluon-implementation-summary.md and updated all references in docs/GLUON-PORT-REPORT.md.
Commit: 2e2b13a
docs/gluon-port-readme.md
Outdated
|
||
This directory contains the Gluon-based implementation of Iris, which uses Triton's `@aggregate` decorator to encapsulate the Iris backend state. | ||
|
||
## Overview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed docs/gluon-port-readme.md.
Commit: (in next push)
docs/GLUON-PORT-REPORT.md
Outdated
|
||
## Deliverables | ||
|
||
### Code Implementation (1,033 lines) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot Remove this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed docs/GLUON-PORT-REPORT.md.
Commit: (in next push)
README.md
Outdated
- **SHMEM-like RMA**: Iris provides SHMEM-like RMA support in Triton. | ||
- **Simple and Intuitive API**: Iris provides simple and intuitive RMA APIs. Writing multi-GPU programs is as easy as writing single-GPU programs. | ||
- **Triton-based**: Iris is built on top of Triton and inherits Triton's performance and capabilities. | ||
- **Gluon-style Aggregate API**: Optional cleaner API using Triton's `@aggregate` decorator for better encapsulation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot add note saying this is experimental
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added "(Experimental)" label to the Gluon-style Aggregate API feature and section header in README.
Commit: (in next push)
README.md
Outdated
backend: iris_gl.IrisBackend): | ||
pid = tl.program_id(0) | ||
block_start = pid * block_size | ||
offsets = block_start + tl.arange(0, block_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot Missing the layout. See example code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the layout parameter to the gl.arange() call in the README example:
layout: gl.constexpr = gl.BlockedLayout([1], [64], [1], [0])
offsets = block_start + gl.arange(0, block_size, layout=layout)
Commit: (in next push)
Removed docs/gluon-implementation-summary.md as requested and updated all references in docs/GLUON-PORT-REPORT.md to reflect the removal. The remaining documentation (api-comparison.md and gluon-port-readme.md) provides sufficient coverage for the Gluon implementation. Co-authored-by: mawad-amd <[email protected]>
- Removed docs/gluon-port-readme.md and docs/GLUON-PORT-REPORT.md - Added (Experimental) label to Gluon API feature in README - Updated Gluon example in README to include proper layout usage - Updated documentation links to reflect removed files - Marked Gluon API section header as experimental Co-authored-by: mawad-amd <[email protected]>
@copilot main has a new function called copy, port it to gluon. |
Ported the copy() function from main iris.py to the Gluon implementation. The method enables direct copying between any two ranks where the current rank must be either the source or destination. Changes: - Added copy() method to IrisDeviceCtx with @gluon.jit decorator - Uses gl.load and gl.store with proper address translation - Updated docs/api-comparison.md with copy() usage examples - Maintains same semantics as original iris.copy() function Co-authored-by: mawad-amd <[email protected]>
Done! Added the Key implementation:
Commit: (in progress)
Done! Added the Key implementation:
Commit: 13cd3f2 |
Gluon Port for Iris - COMPLETE ✅
Successfully completed the Gluon port of Iris using proper Gluon with
@gluon.jit
decorator!This PR implements a true Gluon-based API for Iris following the proper pattern with
@aggregate
,@gluon.jit
, andgl.*
language primitives. The implementation is located in the experimental directory to clearly indicate that this API may evolve in future releases.📊 Implementation Summary
Lines of Code
Files Created/Modified
✅ iris/experimental/iris_gluon.py - Complete Gluon implementation
IrisDeviceCtx
aggregate with @gluon.jit methodsIrisDeviceCtx.initialize()
decodes context tensorgl.*
language primitivesIrisGluon.get_device_context()
returns encoded tensorcopy()
method for direct rank-to-rank data transfers✅ iris/experimental/init.py - Experimental module initialization
✅ examples/06_message_passing/message_passing_gluon.py
@gluon.jit
decoratorgl.*
primitives (gl.load, gl.store, gl.atomic_cas, etc.)✅ examples/07_gemm_all_scatter/gluon/ - Complete package
✅ tests/unittests/test_iris_gluon.py - Unit tests (updated imports)
✅ docs/api-comparison.md - Side-by-side API comparison and migration guide (includes copy() examples)
✅ iris/init.py - Exposed experimental module
✅ README.md - Added experimental Gluon API section with proper layout examples
🎯 Key Features
IrisDeviceCtx Aggregate with Gluon
@aggregate
decoratorinitialize()
method with@gluon.jit
decodes context tensor@gluon.jit
andgl.*
primitives:load()
,store()
,get()
,put()
,copy()
atomic_add()
,atomic_sub()
,atomic_cas()
,atomic_xchg()
,atomic_xor()
,atomic_and()
,atomic_or()
,atomic_min()
,atomic_max()
Examples Ported to Gluon
Producer-Consumer (message_passing_gluon.py)
GEMM All-Scatter (gluon/ subdirectory)
API Pattern
Host Side:
Device Side:
✅ Benefits
initialize()
call decodes contextIrisDeviceCtx: gl.constexpr
contract📚 Documentation
🧪 Testing Status
✅ Completed
⏳ Pending
🚀 Usage
GEMM All-Scatter Example
cd examples/07_gemm_all_scatter/gluon python benchmark.py -m 8192 -n 4608 -k 36864 --validate --benchmark -r 2
API Usage
🎓 Technical Notes
@gluon.jit
for all device methodsgl.*
language primitives (gl.load, gl.store, gl.atomic_*, gl.dot, etc.)[cur_rank, num_ranks, heap_base_0, heap_base_1, ...]
IrisDeviceCtx.initialize()
decodes the tensorgl.BlockedLayout
forgl.arange()
operations (AMD: 64 threads/warp)📈 Impact
This implementation properly uses Gluon's programming model, making it a true Gluon port rather than just using the
@aggregate
decorator. It follows the established patterns from Triton's Gluon examples and demonstrates both simple (message passing) and complex (GEMM all-scatter) use cases with complete supporting infrastructure including kernels, wrappers, benchmarks, and documentation.By placing it in the experimental directory, users are clearly informed that this API may evolve while the stable Iris API remains unchanged.
✨ Ready for Testing
The implementation follows the proper Gluon pattern as requested and is now clearly marked as experimental. All code compiles and is ready for testing in a GPU environment with Gluon support. Both simple and complex examples are provided with complete supporting files (kernels, wrappers, benchmarks, documentation) organized in dedicated subdirectories.
Fixes #184
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.