-
Notifications
You must be signed in to change notification settings - Fork 76
Supporting New Packet Kernel Operation at Executor #677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for two new packet-based reduce operations: REDUCE_COPY_PACKETS and REDUCE_COPY_SEND_PACKETS. These operations combine reduction, copying, and optionally sending data in packet format for distributed GPU communication.
Key Changes:
- Introduces two new operation types for fused reduce-copy and reduce-copy-send operations with packet format
- Implements the kernel handlers for these operations in the C++ execution layer
- Adds Python DSL support with automatic operation fusion logic
- Includes unit tests demonstrating the new operations
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| src/include/execution_common.hpp | Adds enum values for REDUCE_COPY_PACKETS and REDUCE_COPY_SEND_PACKETS |
| src/include/execution_kernel.hpp | Implements handleReduceCopySendPackets template function and integrates it into executeDeviceFunction |
| src/executor/execution_plan.cc | Maps string opcodes "recpkt" and "recspkt" to the new operation types |
| python/mscclpp/language/internal/types.py | Adds Instruction enum values for reduce_copy_packet and reduce_copy_send_packet |
| python/mscclpp/language/internal/operations.py | Extends ReduceOperation to support operation fusion with copy and put operations |
| python/mscclpp/language/channel.py | Adds put_packets method to MemoryChannel class |
| python/mscclpp/language/tests/unit_tests/reduce_copy_packet_test.py | Unit test demonstrating REDUCE_COPY_PACKETS operation |
| python/mscclpp/language/tests/unit_tests/reduce_copy_send_packet_test.py | Unit test demonstrating REDUCE_COPY_SEND_PACKETS operation |
| tools/npkit/npkit_trace_generator.py | Updates event names list to include new operation types |
| include/mscclpp/npkit/npkit_event.hpp | Updates NPKIT_EVENT_EXECUTOR_OP_BASE_EXIT offset to account for new operations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This PR introduces three new operations to enhance flexibility and performance at executor.
One operation can be invoked directly via the DSL API and two operations are created through fusion of existing operations, reducing overhead and improving efficiency.
Port Channel Put Packet (Direct DSL API Call): Sends data from pkt format to the remote side in pkt format via the port channel. Both source and destination buffers must be scratch.
Reduce Copy Packet (Fusion):
Reduce Packet+Copy Packet=Reduce Copy Packet
Triggered when the destination buffer of Reduce Packet matches the source buffer of Copy Packet.
Purpose: Combine reduction and copy into a single step for better performance.
Reduce Copy Send Packet (Fusion):
Reduce Copy Packet+Put Packet=Reduce Copy Send Packet (when dst buffer of Reduce Copy Packet matches src buffer of Put Packet)
Reduce Copy Packet+Read Put Packet=Reduce Copy Send Packet (when dst pkt buffer of Reduce Copy Packet matches src buffer of Read Put Packet)
Purpose: Combine reduction, copy, and send operations into one optimized pipeline.
Fusion Diagram
Reduce Packet + Copy Packet → Reduce Copy Packet
Reduce Copy Packet + Put Packet → Reduce Copy Send Packet
Reduce Copy Packet + Read Put Packet → Reduce Copy Send Packet
Beyond this, this PR adjust the AllReduce 2 Node algorithm:
Message Size | Latency (µs)
1K | 15.34
2K | 15.88
4K | 15.71
8K | 16.01
16K | 15.88
32K | 16.21
64K | 16.90
128K | 18.24
256K | 20.39
512K | 25.26
1M | 32.74
2M | 53.64