Skip to content

Commit e46d5a7

Browse files
authored
Rename to MSCCL (Azure#30)
1 parent 18f3e69 commit e46d5a7

File tree

97 files changed

+517
-512
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

97 files changed

+517
-512
lines changed

.github/workflows/tests.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ jobs:
2121
uses: actions/setup-python@v2
2222
with:
2323
python-version: ${{ matrix.python-version }}
24-
- name: Install sccl and dependencies
24+
- name: Install msccl and dependencies
2525
run: |
2626
pip install --upgrade pip
2727
pip install -r requirements.txt

.gitignore

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# SCCL specific
2-
*.sccl.json
3-
*.sccl.xml
1+
# MSCCL specific
2+
*.msccl.json
3+
*.msccl.xml
44

55
# Byte-compiled / optimized / DLL files
66
__pycache__/

README.md

+42-37
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,16 @@
1-
# SCCL
1+
# MSCCL-tools
22

3-
SCCL is a tool stack for programmable communication on GPUs. Algorithms created with SCCL can:
3+
This repo contains the developer tool stack of the [Microsoft Collective Communication Library
4+
(MSCCL)](https://github.com/microsoft/msccl), a platform for programmable communication on GPUs. Algorithms created with
5+
MSCCL can:
46
- Implement either MPI-style collectives like Allreduce, or any application specific communication pattern.
57
- Target specific hardware and interconnect topologies, unlocking their full potential.
68
- Optimize for the data sizes in your application, making the best tradeoff between latency and bandwidth utilization.
79

8-
SCCL ships with algorithms targeting various Azure multi-GPU VM types. See the [Available Algorithms section](#available-algorithms) to find out what is currently available.
10+
MSCCL-tools also contains pre-made algorithms targeting various Azure multi-GPU VM types. See the [Available Algorithms
11+
section](#available-algorithms) to find out what is currently available.
912

10-
SCCL has two ways of creating new algorithms:
13+
MSCCL has two ways of creating new algorithms:
1114
1. MSCCLang, a high-level DSL that talks about communication in an intuitive chunk-oriented form. See the [MSCCLang
1215
section](#mscclang) for how to get started.
1316
2. Synthesis, which automatically solves optimal algorithms for a given hardware topology. Making synthesis general
@@ -16,26 +19,26 @@ introduction.
1619

1720
## Usage
1821

19-
The SCCL Python package ships with a registry of synthesis strategies and hand optimized algorithms. These can be loaded
20-
into [the runtime](https://github.com/parasailteam/msccl) through the `sccl.init` function, which must be called before
21-
the application creates its NCCL communicator. For PyTorch this means before `torch.distributed` is initialized.
22+
The MSCCL Python package ships with a registry of synthesis strategies and hand optimized algorithms. These can be
23+
loaded into [the runtime](https://github.com/parasailteam/msccl) through the `msccl.init` function, which must be called
24+
before the application creates its NCCL communicator. For PyTorch this means before `torch.distributed` is initialized.
2225

23-
The following snippet requests `sccl.init` to provide an Alltoall algorithm in a configuration of 2 Azure NDv2 machines:
26+
The following snippet requests `msccl.init` to provide an Alltoall algorithm in a configuration of 2 Azure NDv2 machines:
2427
```
25-
import sccl
26-
sccl.init('ndv2', 2, (sccl.Collective.alltoall, ('1MB')))
28+
import msccl
29+
msccl.init('ndv2', 2, (msccl.Collective.alltoall, ('1MB')))
2730
```
2831
This will find an algorithm provider that can create an Alltoall algorithm that is expected to be good with 1MB of data.
29-
That will call a synthesis routine that writes the algorithm to disk. `sccl.init` will then pass a configuration file
32+
That will call a synthesis routine that writes the algorithm to disk. `msccl.init` will then pass a configuration file
3033
pointing to this algorithm to the runtime through environment variables.
3134

32-
See [the examples](examples/sccl_init.py) for more on `sccl.init` usage.
35+
See [the examples](examples/msccl_init.py) for more on `msccl.init` usage.
3336

3437
## Available Algorithms
3538

36-
SCCL's built-in algorithms are registered for combinations of hardware configuration and size of input data where we
37-
have benchmarked them to provide speedup over NCCL. To list the algorithms currently in SCCL's built-in registry, run
38-
`sccl plans list` on the command line. This will print out the following table (on 4/22/2022):
39+
MSCCL's built-in algorithms are registered for combinations of hardware configuration and size of input data where we
40+
have benchmarked them to provide speedup over NCCL. To list the algorithms currently in MSCCL's built-in registry, run
41+
`msccl plans list` on the command line. This will print out the following table (on 4/22/2022):
3942

4043
| Machine | Collective | # machines | From | To | Protocol | Priority | Plan name |
4144
|-----------|--------------|--------------|--------|----------|------------|------------|-------------------------------------|
@@ -51,49 +54,51 @@ Each line lists an algorithm registration and the conditions under which it is t
5154
- there are 8, 16, 32 or 64 Azure NDv4 machines, and
5255
- the data size is from 1 MB to 32 MB.
5356

54-
The repository [parasailteam/sccl-presynth](https://github.com/parasailteam/sccl-presynth) repository offers additional algorithms that have been
55-
pre-synthesized for fixed configurations. To enable them install the package and import it before the call to
56-
`sccl.init`.
57+
The repository [parasailteam/msccl-presynth](https://github.com/parasailteam/msccl-presynth) repository offers
58+
additional algorithms that have been pre-synthesized for fixed configurations. To enable them install the package and
59+
import it before the call to `msccl.init`.
5760

5861
## MSCCLang
5962

60-
MSCCLang is a high-level language for specifying collective communication algorithms in an intuitive chunk-oriented form. The language is available as a Python-integrated DSL.
63+
MSCCLang is a high-level language for specifying collective communication algorithms in an intuitive chunk-oriented
64+
form. The language is available as a Python-integrated DSL.
6165

62-
The language is still under development and lacks comprehensive documentation. For now, please refer to [the pre-print of our upcoming paper](https://arxiv.org/pdf/2201.11840.pdf) and the examples in [examples/scclang](examples/scclang/).
66+
The language is still under development and lacks comprehensive documentation. For now, please refer to [the pre-print
67+
of our upcoming paper](https://arxiv.org/pdf/2201.11840.pdf) and the examples in
68+
[examples/mscclang](examples/mscclang/).
6369

6470
## Synthesis
6571

66-
SCCL started out as a synthesizer for collective algorithms, and general synthesis of collective algorithms is an
67-
on-going research project. See [this readme](SYNTHESIS.md) for using SCCL as a synthesizer.
72+
MSCCL started out as a synthesizer for collective algorithms, and general synthesis of collective algorithms is an
73+
on-going research project. See [this readme](SYNTHESIS.md) for using MSCCL as a synthesizer.
6874

6975
## Installation
7076

7177
### Python Package Installation
7278

7379
To install either clone this repo and run "`pip install .`" or run:
7480
```
75-
pip install git+https://github.com/microsoft/sccl.git
81+
pip install git+https://github.com/microsoft/msccl.git
7682
```
7783

78-
Installing the SCCL Python package also installs the `sccl` command line tool. To enable Bash completion for the `sccl`
79-
tool:
84+
Installing the MSCCL Python package also installs the `msccl` command line tool. To enable Bash completion for the
85+
`msccl` tool:
8086
```
81-
echo 'eval "$(register-python-argcomplete sccl)"' >> ~/.bashrc
87+
echo 'eval "$(register-python-argcomplete msccl)"' >> ~/.bashrc
8288
```
8389

8490
### Runtime Installation
8591

86-
SCCL's algorithms are executed by the [Microsoft Collective Communication Library
87-
(MSCCL)](https://github.com/microsoft/msccl), which is API compatible with NCCL. See https://github.com/microsoft/msccl
88-
for instructions.
92+
Algorithms are executed by the [Microsoft Collective Communication Library (MSCCL)](https://github.com/microsoft/msccl),
93+
which is API compatible with NCCL. See https://github.com/microsoft/msccl for instructions.
8994

90-
To use SCCL with PyTorch, the built in NCCL submodule has to be replaced with SCCL's version. Additionally, to expose
91-
the new native Alltoall support that SCCL adds, PyTorch's `torch.distributed` package can optionally be patched. The
92-
following commands perform these steps and install PyTorch with SCCL:
95+
To use MSCCL with PyTorch, the built in NCCL submodule has to be replaced with MSCCL's version. Additionally, to expose
96+
the new native Alltoall support that MSCCL adds, PyTorch's `torch.distributed` package can optionally be patched. The
97+
following commands perform these steps and install PyTorch with MSCCL:
9398
```
9499
git clone https://github.com/pytorch/pytorch.git
95100
cd pytorch    
96-
git checkout tags/v1.9.0 -b v1.9.0_sccl
101+
git checkout tags/v1.9.0 -b v1.9.0_msccl
97102
perl -p -i -e  's/url = https:\/\/github\.com\/NVIDIA\/nccl/url = https:\/\/github\.com\/microsoft\/msccl/g' .gitmodules
98103
git submodule sync third_party/nccl
99104
git submodule update --init --recursive
@@ -105,10 +110,10 @@ python setup.py install
105110
### Note on Azure NDv2
106111

107112
Azure NDv2 does not expose the true PCIe topology of the machines to the VM and worse, does not assign PCIe devices
108-
consistently to the virtual paths in the VM. As SCCL is generating topology-aware algorithms, this device ordering must
109-
be fixed. The [sccl_ndv2_launcher.sh](sccl/autosynth/sccl_ndv2_launcher.sh) script can be used to fix this problem. The
110-
script solves the automorphisms from the local VM's NVLink topology to the reference topology and selects one of the 4
111-
automorphisms based on measured placement of the Infiniband card such that GPU 0 is close to the NIC. A tool called
113+
consistently to the virtual paths in the VM. As MSCCL is generating topology-aware algorithms, this device ordering must
114+
be fixed. The [msccl_ndv2_launcher.sh](msccl/autosynth/msccl_ndv2_launcher.sh) script can be used to fix this problem.
115+
The script solves the automorphisms from the local VM's NVLink topology to the reference topology and selects one of the
116+
4 automorphisms based on measured placement of the Infiniband card such that GPU 0 is close to the NIC. A tool called
112117
[inspector-topo](https://github.com/microsoft/inspector-topo) needs to be available for the latter step.
113118

114119
## Contributing

SYNTHESIS.md

+23-23
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,27 @@
11
## Synthesizing Algorithms
22

3-
SCCL can synthesize algorithms for a given *topology* that implements a given *collective* in a given number of steps, bandwidth usage, memory limits, etc. These additional parameters are called the *instance*.
3+
MSCCL can synthesize algorithms for a given *topology* that implements a given *collective* in a given number of steps, bandwidth usage, memory limits, etc. These additional parameters are called the *instance*.
44

5-
SCCL groups its solver strategies under the `sccl solve` subcommand. For example, to synthesize a specific `instance` of an Allgather algorithm for the [NVIDIA DGX-1](https://www.nvidia.com/en-us/data-center/dgx-1/) that completes in 4 steps:
5+
MSCCL groups its solver strategies under the `msccl solve` subcommand. For example, to synthesize a specific `instance` of an Allgather algorithm for the [NVIDIA DGX-1](https://www.nvidia.com/en-us/data-center/dgx-1/) that completes in 4 steps:
66
```
7-
$ sccl solve instance DGX1 Allgather --steps 4
7+
$ msccl solve instance DGX1 Allgather --steps 4
88
Solving instance steps=4... synthesized! (0.7s)
9-
Wrote to Allgather.n8-DGX1-steps4.sccl.json
9+
Wrote to Allgather.n8-DGX1-steps4.msccl.json
1010
```
11-
The instance is satisfiable and `sccl` saves it to a file.
11+
The instance is satisfiable and `msccl` saves it to a file.
1212

1313
Four steps is not necessarily the least number of steps required. To find the least steps:
1414
```
15-
$ sccl solve least-steps DGX1 Allgather
15+
$ msccl solve least-steps DGX1 Allgather
1616
Algorithms need at least 2 steps.
1717
Solving instance steps=2... synthesized! (0.2s)
18-
Wrote to Allgather.n8-DGX1-steps2.sccl.json
18+
Wrote to Allgather.n8-DGX1-steps2.msccl.json
1919
```
2020
The `least-steps` strategy statically determines that any Allgather in a DGX-1 requires at least 2 steps and starting from that finds the smallest satisfiable number of steps.
2121

2222
While this two step algorithm is a latency-optimal one, there may be other algorithms that achieve higher bandwidth. The `pareto-optimal` strategy searches through different latency-bandwidth tradeoffs:
2323
```
24-
$ sccl solve pareto-optimal DGX1 Allgather
24+
$ msccl solve pareto-optimal DGX1 Allgather
2525
Algorithms need at least 2 steps.
2626
Algorithms need at least 7/6 rounds per chunk.
2727
Solving instance steps=2... synthesized! (0.5s)
@@ -34,13 +34,13 @@ Solving instance steps=3,rounds=6,chunks=5... synthesized! (44.0s)
3434
Solving instance steps=3,rounds=7,chunks=6... synthesized! (56.1s)
3535
Bandwidth optimal algorithm found!
3636
Found 2 Pareto optimal algorithms. Pruned 4 non-optimal algorithms.
37-
Wrote to Allgather.n8-DGX1-steps2.rounds3.chunks2.sccl.json
38-
Wrote to Allgather.n8-DGX1-steps3.rounds7.chunks6.sccl.json
37+
Wrote to Allgather.n8-DGX1-steps2.rounds3.chunks2.msccl.json
38+
Wrote to Allgather.n8-DGX1-steps3.rounds7.chunks6.msccl.json
3939
```
4040

4141
## Collectives
4242

43-
SCCL includes a number of built in common collectives.
43+
MSCCL includes a number of built in common collectives.
4444

4545
| Collective | Arguments | Description | Kind |
4646
| - | - | - | - |
@@ -60,16 +60,16 @@ SCCL includes a number of built in common collectives.
6060

6161
Custom collectives may be defined by instantiating the `Collective` class, which is easiest through the `build_collective` function. For example, a send from rank 2 to rank 7 in an 8 node topology can be defined and saved with:
6262
```
63-
from sccl.collectives import build_collective
64-
from sccl.serialization import save_sccl_object
63+
from msccl.collectives import build_collective
64+
from msccl.serialization import save_msccl_object
6565
6666
precondition = lambda r, c: r == 2
6767
postcondition = lambda r, c: r == 7
6868
coll = build_collective('Send', 8, 1, precondition, postcondition)
69-
save_sccl_object(coll, 'send.json')
69+
save_msccl_object(coll, 'send.json')
7070
```
7171

72-
The *kind* of the collective determines support for some features of SCCL:
72+
The *kind* of the collective determines support for some features of MSCCL:
7373
- **NC** are non-combining collectives, and are always supported.
7474
- **CR** are combining collectives that have a non-combining dual collective, and are supported through a reduction.
7575
- **CNR** are combining collectives with no dual, which may not always be supported.
@@ -78,25 +78,25 @@ Currently the rounds per chunk analysis described below can not support CNR coll
7878

7979
## Steps and Rounds
8080

81-
SCCL uses two related concepts, *steps and rounds*, to talk about the running time of algorithms. *Steps* is how many sequential sets of sends the algorithm consists of, where all sends inside a step execute in parallel. The number of sends between two nodes in a single step is limited by the bandwidth available in the topology. However, a step may consist of multiple *rounds*, which acts as a multiplier for all links in the topology during that step.
81+
MSCCL uses two related concepts, *steps and rounds*, to talk about the running time of algorithms. *Steps* is how many sequential sets of sends the algorithm consists of, where all sends inside a step execute in parallel. The number of sends between two nodes in a single step is limited by the bandwidth available in the topology. However, a step may consist of multiple *rounds*, which acts as a multiplier for all links in the topology during that step.
8282

8383
How much data a single round corresponds to depends on what is the actual size of a chunk at runtime, and how many chunks a collective uses can change (e.g. you can control this directly in the `instance` strategy by setting `--chunks N`). Thus for each collective the total data usage of different algorithms implementing it can be measured with their *rounds per chunk*.
8484

85-
SCCL provides a standalone analysis to find a lower bound for the *rounds per chunk* required by any instance. For example, to find the least rouds per chunk for an Alltoall in a DGX-1:
85+
MSCCL provides a standalone analysis to find a lower bound for the *rounds per chunk* required by any instance. For example, to find the least rouds per chunk for an Alltoall in a DGX-1:
8686
```
87-
$ sccl analyze rounds DGX1 Gather
87+
$ msccl analyze rounds DGX1 Gather
8888
Gather(n=8,root=0) algorithms need at least 7/6 rounds in DGX1 topology.
8989
```
9090
In this case the bound happens to be tight and the `pareto-optimal` strategy would use it to detect that it has found a bandwidth optimal algorithm.
9191

9292
## Distributed Algorithms
9393

94-
SCCL provides routines to synthesize algorithms for distributed topologies under the `sccl distribute` subcommand. These work by using algorithms for a local collective and stitcing instances of it together to create a distributed one.
94+
MSCCL provides routines to synthesize algorithms for distributed topologies under the `msccl distribute` subcommand. These work by using algorithms for a local collective and stitcing instances of it together to create a distributed one.
9595

9696
**Alltoall from Gather and Scatter:** `alltoall-gather-scatter` combines a Gather and a Scatter algorithm with a transpose step in the middle to form a distributed Alltoall algorithm. For example, an Alltoall algorithm for a cluster of 4 DGX-1 machines can be created with:
9797
```
98-
sccl solve least-steps DGX1 Gather -o gather.json
99-
sccl solve least-steps DGX1 Scatter -o scatter.json --root 1
100-
sccl distribute alltoall-gather-scatter gather.json scatter.json --copies 4 -o alltoall.json
98+
msccl solve least-steps DGX1 Gather -o gather.json
99+
msccl solve least-steps DGX1 Scatter -o scatter.json --root 1
100+
msccl distribute alltoall-gather-scatter gather.json scatter.json --copies 4 -o alltoall.json
101101
```
102-
This distributor works with any Gather and Scatter algorithm, as long as their roots have a direct connection in the topology. SCCL also provides multi-root versions of Gather and Scatter that can be substituted here.
102+
This distributor works with any Gather and Scatter algorithm, as long as their roots have a direct connection in the topology. MSCCL also provides multi-root versions of Gather and Scatter that can be substituted here.

dockerfiles/Dockerfile

+8-8
Original file line numberDiff line numberDiff line change
@@ -50,10 +50,10 @@ RUN cd ${STAGE_DIR} && mkdir openmpi/ && cd openmpi && wget https://www.open-mpi
5050
rm -rf ${STAGE_DIR}/openmpi/
5151

5252
##############################################################################
53-
# SCCL
53+
# MSCCL
5454
##############################################################################
5555

56-
# update NCCL in pytorch, install SCCL interpreter
56+
# update NCCL in pytorch, install MSCCL interpreter
5757
RUN pip uninstall torch -y
5858

5959
RUN pip install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses
@@ -62,11 +62,11 @@ RUN conda install -c pytorch magma-cuda111 -y
6262

6363
ENV CMAKE_PREFIX_PATH=/opt/conda
6464

65-
# Change NCCL to SCCL Runtime
65+
# Change NCCL to MSCCL Runtime
6666
RUN cd ${STAGE_DIR} && \
6767
git clone https://github.com/pytorch/pytorch.git && \
6868
cd pytorch && \
69-
git checkout tags/v1.9.0 -b v1.9.0_sccl && \
69+
git checkout tags/v1.9.0 -b v1.9.0_msccl && \
7070
perl -p -i -e 's/url = https:\/\/github\.com\/NVIDIA\/nccl/url = https:\/\/github\.com\/microsoft\/msccl/g' .gitmodules && \
7171
git submodule sync third_party/nccl && \
7272
git submodule update --init --recursive && \
@@ -79,12 +79,12 @@ RUN cd ${STAGE_DIR} && \
7979
cd ${STAGE_DIR} && \
8080
rm -rf ${STAGE_DIR}/pytorch
8181

82-
# Install SCCL
82+
# Install MSCCL
8383
RUN cd ${STAGE_DIR}/ && \
84-
git clone https://github.com/microsoft/sccl.git && \
85-
cd sccl/ && python setup.py install && \
84+
git clone https://github.com/microsoft/msccl.git && \
85+
cd msccl/ && python setup.py install && \
8686
cd ${STAGE_DIR} && \
87-
rm -rf ${STAGE_DIR}/sccl/
87+
rm -rf ${STAGE_DIR}/msccl/
8888

8989
##############################################################################
9090
# inspector-topo

0 commit comments

Comments
 (0)