Intra-node shared memory (SHM) optimizations for CPU primitives #458

gaopengff · 2025-07-31T07:57:42Z

This PR is for RFC #455. It has implemented shm allreduce.

meta-cla · 2025-07-31T07:57:49Z

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

d4l3k · 2025-08-11T17:14:13Z

gloo/allreduce.cc

+  if (is_intra_node(context->size)) {
+    shm(opts);
+    return;
+  }


can we move this into the allreduce() function above? fits a bit better there since we pick the algorithm at that level

may be nice to add a RING_LOCAL or RING_SHMEM algorithm selector as well

d4l3k · 2025-08-11T17:15:08Z

gloo/allreduce.cc

@@ -153,6 +154,15 @@ void ring(
  const auto slot = Slot::build(kAllreduceSlotPrefix, opts.tag);
  const size_t totalBytes = opts.elements * opts.elementSize;

+
+  if (is_intra_node(context->size)) {


we should also check if the tensor is a CUDA tensor and bypass if it is

d4l3k · 2025-08-11T17:16:28Z

gloo/allreduce_shm.cc

+
+bool is_intra_node(const int size) {
+    // must launch with torchrun
+  auto local_size_string = std::getenv("LOCAL_WORLD_SIZE");


I don't think this check is safe -- for torchft for instance we often run with Gloo only cross host and if you're using an 8x8 configuration this would trigger shm logic for cross host comms

gaopengff added 6 commits July 11, 2025 01:46

add shm allreduce

1738e8e

add bf16 and half support

76d1114

remove bf16 support

2d152a3

add bf16 support

8c29eeb

use reduce function to do reduce job

554d317

refine format

0fdde35

jianan-gu mentioned this pull request Jul 31, 2025

[RFC] Intra-node shared memory (SHM) optimizations for communication operators on CPUs #455

Open

3 tasks

fix accuracy issue

be7da7c

d4l3k requested changes Aug 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Intra-node shared memory (SHM) optimizations for CPU primitives #458

Intra-node shared memory (SHM) optimizations for CPU primitives #458

gaopengff commented Jul 31, 2025

Uh oh!

meta-cla bot commented Jul 31, 2025

Uh oh!

d4l3k Aug 11, 2025

Uh oh!

d4l3k Aug 11, 2025

Uh oh!

d4l3k Aug 11, 2025

Uh oh!

Uh oh!

Intra-node shared memory (SHM) optimizations for CPU primitives #458

Are you sure you want to change the base?

Intra-node shared memory (SHM) optimizations for CPU primitives #458

Conversation

gaopengff commented Jul 31, 2025

Uh oh!

meta-cla bot commented Jul 31, 2025

Action Required

Process

Uh oh!

d4l3k Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!