Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
0053b8b
Initial plan
Copilot Oct 2, 2025
eb5df91
Add Gluon-based Iris implementation and producer-consumer example
Copilot Oct 2, 2025
77a22b1
Fix Gluon implementation to use Triton language primitives correctly
Copilot Oct 2, 2025
479eec8
Add Gluon API to main init and create unit test
Copilot Oct 2, 2025
9bf431a
Add comprehensive documentation for Gluon port
Copilot Oct 2, 2025
e2f7b2b
Update README with Gluon API documentation and example
Copilot Oct 2, 2025
0e45b9f
Add final implementation report for Gluon port
Copilot Oct 2, 2025
7f287a6
Simplify API by removing redundant cur_rank parameter from methods
Copilot Oct 2, 2025
68bb37a
Rewrite Gluon implementation to properly use @gluon.jit and gl language
Copilot Oct 3, 2025
44cf831
Fix runtime bugs
mawad-amd Oct 3, 2025
eb1f434
Fix linter errors
mawad-amd Oct 3, 2025
1c1eae0
Fix linter errors
mawad-amd Oct 3, 2025
49be5a6
Fix linter errors
mawad-amd Oct 3, 2025
e63453a
Apply Ruff auto-fixes
github-actions[bot] Oct 3, 2025
d941d67
Merge branch 'main' into copilot/fix-c90b7802-bcbe-46b5-a825-1b55a3d1…
mawad-amd Oct 3, 2025
8785f2a
Port 07_gemm_all_scatter example to Gluon
Copilot Oct 3, 2025
e8f1cf9
Organize Gluon GEMM example in subdirectory with all supporting files
Copilot Oct 3, 2025
63e0e95
Move Gluon backend to experimental directory
Copilot Oct 3, 2025
1d61b87
Apply Ruff auto-fixes
github-actions[bot] Oct 3, 2025
2e2b13a
Remove gluon-implementation-summary.md documentation file
Copilot Oct 3, 2025
5eba090
Clean up documentation and update README with experimental notes
Copilot Oct 3, 2025
d0f9952
Merge branch 'main' into copilot/fix-c90b7802-bcbe-46b5-a825-1b55a3d1…
mawad-amd Oct 11, 2025
13cd3f2
Add copy method to Gluon IrisDeviceCtx
Copilot Oct 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Iris is a Triton-based framework for Remote Memory Access (RMA) operations devel
- **SHMEM-like RMA**: Iris provides SHMEM-like RMA support in Triton.
- **Simple and Intuitive API**: Iris provides simple and intuitive RMA APIs. Writing multi-GPU programs is as easy as writing single-GPU programs.
- **Triton-based**: Iris is built on top of Triton and inherits Triton's performance and capabilities.
- **Gluon-style Aggregate API (Experimental)**: Optional cleaner API using Triton's `@aggregate` decorator for better encapsulation.

## Documentation

Expand All @@ -26,6 +27,7 @@ Iris is a Triton-based framework for Remote Memory Access (RMA) operations devel
- [Examples](https://rocm.github.io/iris/reference/examples.html)
- [Fine-grained GEMM & Communication Overlap](https://rocm.github.io/iris/conceptual/finegrained-overlap.html)
- [Setup Alternatives](https://rocm.github.io/iris/getting-started/installation.html)
- [API Comparison](docs/api-comparison.md) - Original vs Gluon API comparison

## API Example

Expand Down Expand Up @@ -98,6 +100,46 @@ if __name__ == "__main__":
mp.spawn(_worker, args=(world_size,), nprocs=world_size, join=True)
```

### Alternative: Gluon-style Aggregate API (Experimental)

Iris also provides an experimental cleaner API using Triton's Gluon with `@gluon.jit` decorator:

```python
import iris.experimental.iris_gluon as iris_gl
from triton.experimental import gluon
from triton.experimental.gluon import language as gl

# Device-side APIs - context encapsulates heap_bases
@gluon.jit
def kernel(IrisDeviceCtx: gl.constexpr, context_tensor,
buffer, buffer_size: gl.constexpr, block_size: gl.constexpr):
# Initialize device context from tensor
ctx = IrisDeviceCtx.initialize(context_tensor)

pid = gl.program_id(0)
block_start = pid * block_size
layout: gl.constexpr = gl.BlockedLayout([1], [64], [1], [0])
offsets = block_start + gl.arange(0, block_size, layout=layout)
mask = offsets < buffer_size

# Store 1 in the target buffer - no need to pass heap_bases separately!
target_rank = 1
ctx.store(buffer + offsets, 1, target_rank, mask=mask)

def _worker(rank, world_size):
# Initialize as before...
iris_ctx = iris_gl.iris(heap_size)
context_tensor = iris_ctx.get_device_context() # Get encoded context

buffer = iris_ctx.zeros(buffer_size, device="cuda", dtype=torch.float32)

if cur_rank == source_rank:
kernel[(grid,)](iris_gl.IrisDeviceCtx, context_tensor,
buffer, buffer_size, block_size, num_warps=1)
```

See [docs/api-comparison.md](docs/api-comparison.md) for a complete comparison.

## Quick Start Guide

### Quick Installation
Expand Down
Loading