TL/UCP: Add all-reduce ring alogorithm #1082
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What
This PR adds a new ring-based Allreduce algorithm (named
"ring"
) to the UCP transport layer within UCC. It introduces:allreduce_ring.c
implementing the ring-based method.Makefile.am
) to include the new file.allreduce.[ch]
), including a new enum valueUCC_TL_UCP_ALLREDUCE_ALG_RING
, new function prototypes, and references in the algorithm registration.allreduce_ring.c
that manages per-rank scratch buffers, chunk-based sending/receiving, and reduction.Why ?
A ring-based Allreduce can be more efficient for large message sizes, especially on relatively simple or homogeneous network topologies. It complements existing Allreduce algorithms (e.g., knomial, sliding window, DBT) by providing:
How ?
The ring algorithm splits the input data into chunks, then circulates these chunks around the ring of ranks. Each rank performs local partial reductions on received data and passes it along. The main changes include:
File Additions/Modifications:
allreduce_ring.c
: Implements the ring-based send/recv steps, in-place or out-of-place usage, and partial data reductions viaucc_dt_reduce
.Makefile.am
: Includes the new file in the build.allreduce.c/allreduce.h
: Adds the new"ring"
algorithm ID and associated function prototypes.Implementation Details:
num_chunks
, typically equal to the number of ranks. Each chunk is passed around the ring (sendto
/recvfrom
) and reduced in a scratch buffer.scratch
buffer is allocated per rank to hold incoming chunk data before reduction.Code Flow:
ucc_dt_reduce
on each incoming portion.By adding this ring-based approach, UCC gains a more complete suite of collective algorithms for Allreduce, allowing users and internal heuristics to pick the best method based on message size, topology, and system capabilities.