Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/UCP: Add all-reduce ring alogorithm #1082

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

armratner
Copy link

What

This PR adds a new ring-based Allreduce algorithm (named "ring") to the UCP transport layer within UCC. It introduces:

  • A new source file allreduce_ring.c implementing the ring-based method.
  • Modifications to the build system (Makefile.am) to include the new file.
  • Updates to the Allreduce interface (allreduce.[ch]), including a new enum value UCC_TL_UCP_ALLREDUCE_ALG_RING, new function prototypes, and references in the algorithm registration.
  • The ring-based algorithm’s logic (init, start, progress, and finalize) in allreduce_ring.c that manages per-rank scratch buffers, chunk-based sending/receiving, and reduction.

Why ?

A ring-based Allreduce can be more efficient for large message sizes, especially on relatively simple or homogeneous network topologies. It complements existing Allreduce algorithms (e.g., knomial, sliding window, DBT) by providing:

  • Improved scalability for certain message sizes.
  • A straightforward method for ring-style communication patterns common in distributed HPC and AI workloads.

How ?

The ring algorithm splits the input data into chunks, then circulates these chunks around the ring of ranks. Each rank performs local partial reductions on received data and passes it along. The main changes include:

  1. File Additions/Modifications:

    • allreduce_ring.c: Implements the ring-based send/recv steps, in-place or out-of-place usage, and partial data reductions via ucc_dt_reduce.
    • Makefile.am: Includes the new file in the build.
    • allreduce.c/allreduce.h: Adds the new "ring" algorithm ID and associated function prototypes.
  2. Implementation Details:

    • Data is divided into num_chunks, typically equal to the number of ranks. Each chunk is passed around the ring (sendto/recvfrom) and reduced in a scratch buffer.
    • A scratch buffer is allocated per rank to hold incoming chunk data before reduction.
    • The algorithm ensures all chunks complete one round in the ring, then finalizes once the entire data is fully reduced on each rank.
  3. Code Flow:

    • Init: Sets up the ring task, scratch buffer, and references to the team’s executor.
    • Start: Posts initial sends/receives and enqueues the progress function.
    • Progress: Drives the ring of sends/receives chunk by chunk, calling ucc_dt_reduce on each incoming portion.
    • Finalize: Cleans up (frees scratch space and finishes the task).

By adding this ring-based approach, UCC gains a more complete suite of collective algorithms for Allreduce, allowing users and internal heuristics to pick the best method based on message size, topology, and system capabilities.

@swx-jenkins3
Copy link

Can one of the admins verify this patch?

@armratner
Copy link
Author

Working on Gtest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants