You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+42-37
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,16 @@
1
-
# SCCL
1
+
# MSCCL-tools
2
2
3
-
SCCL is a tool stack for programmable communication on GPUs. Algorithms created with SCCL can:
3
+
This repo contains the developer tool stack of the [Microsoft Collective Communication Library
4
+
(MSCCL)](https://github.com/microsoft/msccl), a platform for programmable communication on GPUs. Algorithms created with
5
+
MSCCL can:
4
6
- Implement either MPI-style collectives like Allreduce, or any application specific communication pattern.
5
7
- Target specific hardware and interconnect topologies, unlocking their full potential.
6
8
- Optimize for the data sizes in your application, making the best tradeoff between latency and bandwidth utilization.
7
9
8
-
SCCL ships with algorithms targeting various Azure multi-GPU VM types. See the [Available Algorithms section](#available-algorithms) to find out what is currently available.
10
+
MSCCL-tools also contains pre-made algorithms targeting various Azure multi-GPU VM types. See the [Available Algorithms
11
+
section](#available-algorithms) to find out what is currently available.
9
12
10
-
SCCL has two ways of creating new algorithms:
13
+
MSCCL has two ways of creating new algorithms:
11
14
1. MSCCLang, a high-level DSL that talks about communication in an intuitive chunk-oriented form. See the [MSCCLang
12
15
section](#mscclang) for how to get started.
13
16
2. Synthesis, which automatically solves optimal algorithms for a given hardware topology. Making synthesis general
@@ -16,26 +19,26 @@ introduction.
16
19
17
20
## Usage
18
21
19
-
The SCCL Python package ships with a registry of synthesis strategies and hand optimized algorithms. These can be loaded
20
-
into [the runtime](https://github.com/parasailteam/msccl) through the `sccl.init` function, which must be called before
21
-
the application creates its NCCL communicator. For PyTorch this means before `torch.distributed` is initialized.
22
+
The MSCCL Python package ships with a registry of synthesis strategies and hand optimized algorithms. These can be
23
+
loaded into [the runtime](https://github.com/parasailteam/msccl) through the `msccl.init` function, which must be called
24
+
before the application creates its NCCL communicator. For PyTorch this means before `torch.distributed` is initialized.
22
25
23
-
The following snippet requests `sccl.init` to provide an Alltoall algorithm in a configuration of 2 Azure NDv2 machines:
26
+
The following snippet requests `msccl.init` to provide an Alltoall algorithm in a configuration of 2 Azure NDv2 machines:
@@ -51,49 +54,51 @@ Each line lists an algorithm registration and the conditions under which it is t
51
54
- there are 8, 16, 32 or 64 Azure NDv4 machines, and
52
55
- the data size is from 1 MB to 32 MB.
53
56
54
-
The repository [parasailteam/sccl-presynth](https://github.com/parasailteam/sccl-presynth) repository offers additional algorithms that have been
55
-
pre-synthesized for fixed configurations. To enable them install the package and import it before the call to
56
-
`sccl.init`.
57
+
The repository [parasailteam/msccl-presynth](https://github.com/parasailteam/msccl-presynth) repository offers
58
+
additional algorithms that have been pre-synthesized for fixed configurations. To enable them install the package and
59
+
import it before the call to `msccl.init`.
57
60
58
61
## MSCCLang
59
62
60
-
MSCCLang is a high-level language for specifying collective communication algorithms in an intuitive chunk-oriented form. The language is available as a Python-integrated DSL.
63
+
MSCCLang is a high-level language for specifying collective communication algorithms in an intuitive chunk-oriented
64
+
form. The language is available as a Python-integrated DSL.
61
65
62
-
The language is still under development and lacks comprehensive documentation. For now, please refer to [the pre-print of our upcoming paper](https://arxiv.org/pdf/2201.11840.pdf) and the examples in [examples/scclang](examples/scclang/).
66
+
The language is still under development and lacks comprehensive documentation. For now, please refer to [the pre-print
67
+
of our upcoming paper](https://arxiv.org/pdf/2201.11840.pdf) and the examples in
68
+
[examples/mscclang](examples/mscclang/).
63
69
64
70
## Synthesis
65
71
66
-
SCCL started out as a synthesizer for collective algorithms, and general synthesis of collective algorithms is an
67
-
on-going research project. See [this readme](SYNTHESIS.md) for using SCCL as a synthesizer.
72
+
MSCCL started out as a synthesizer for collective algorithms, and general synthesis of collective algorithms is an
73
+
on-going research project. See [this readme](SYNTHESIS.md) for using MSCCL as a synthesizer.
68
74
69
75
## Installation
70
76
71
77
### Python Package Installation
72
78
73
79
To install either clone this repo and run "`pip install .`" or run:
SCCL can synthesize algorithms for a given *topology* that implements a given *collective* in a given number of steps, bandwidth usage, memory limits, etc. These additional parameters are called the *instance*.
3
+
MSCCL can synthesize algorithms for a given *topology* that implements a given *collective* in a given number of steps, bandwidth usage, memory limits, etc. These additional parameters are called the *instance*.
4
4
5
-
SCCL groups its solver strategies under the `sccl solve` subcommand. For example, to synthesize a specific `instance` of an Allgather algorithm for the [NVIDIA DGX-1](https://www.nvidia.com/en-us/data-center/dgx-1/) that completes in 4 steps:
5
+
MSCCL groups its solver strategies under the `msccl solve` subcommand. For example, to synthesize a specific `instance` of an Allgather algorithm for the [NVIDIA DGX-1](https://www.nvidia.com/en-us/data-center/dgx-1/) that completes in 4 steps:
6
6
```
7
-
$ sccl solve instance DGX1 Allgather --steps 4
7
+
$ msccl solve instance DGX1 Allgather --steps 4
8
8
Solving instance steps=4... synthesized! (0.7s)
9
-
Wrote to Allgather.n8-DGX1-steps4.sccl.json
9
+
Wrote to Allgather.n8-DGX1-steps4.msccl.json
10
10
```
11
-
The instance is satisfiable and `sccl` saves it to a file.
11
+
The instance is satisfiable and `msccl` saves it to a file.
12
12
13
13
Four steps is not necessarily the least number of steps required. To find the least steps:
14
14
```
15
-
$ sccl solve least-steps DGX1 Allgather
15
+
$ msccl solve least-steps DGX1 Allgather
16
16
Algorithms need at least 2 steps.
17
17
Solving instance steps=2... synthesized! (0.2s)
18
-
Wrote to Allgather.n8-DGX1-steps2.sccl.json
18
+
Wrote to Allgather.n8-DGX1-steps2.msccl.json
19
19
```
20
20
The `least-steps` strategy statically determines that any Allgather in a DGX-1 requires at least 2 steps and starting from that finds the smallest satisfiable number of steps.
21
21
22
22
While this two step algorithm is a latency-optimal one, there may be other algorithms that achieve higher bandwidth. The `pareto-optimal` strategy searches through different latency-bandwidth tradeoffs:
Found 2 Pareto optimal algorithms. Pruned 4 non-optimal algorithms.
37
-
Wrote to Allgather.n8-DGX1-steps2.rounds3.chunks2.sccl.json
38
-
Wrote to Allgather.n8-DGX1-steps3.rounds7.chunks6.sccl.json
37
+
Wrote to Allgather.n8-DGX1-steps2.rounds3.chunks2.msccl.json
38
+
Wrote to Allgather.n8-DGX1-steps3.rounds7.chunks6.msccl.json
39
39
```
40
40
41
41
## Collectives
42
42
43
-
SCCL includes a number of built in common collectives.
43
+
MSCCL includes a number of built in common collectives.
44
44
45
45
| Collective | Arguments | Description | Kind |
46
46
| - | - | - | - |
@@ -60,16 +60,16 @@ SCCL includes a number of built in common collectives.
60
60
61
61
Custom collectives may be defined by instantiating the `Collective` class, which is easiest through the `build_collective` function. For example, a send from rank 2 to rank 7 in an 8 node topology can be defined and saved with:
The *kind* of the collective determines support for some features of SCCL:
72
+
The *kind* of the collective determines support for some features of MSCCL:
73
73
-**NC** are non-combining collectives, and are always supported.
74
74
-**CR** are combining collectives that have a non-combining dual collective, and are supported through a reduction.
75
75
-**CNR** are combining collectives with no dual, which may not always be supported.
@@ -78,25 +78,25 @@ Currently the rounds per chunk analysis described below can not support CNR coll
78
78
79
79
## Steps and Rounds
80
80
81
-
SCCL uses two related concepts, *steps and rounds*, to talk about the running time of algorithms. *Steps* is how many sequential sets of sends the algorithm consists of, where all sends inside a step execute in parallel. The number of sends between two nodes in a single step is limited by the bandwidth available in the topology. However, a step may consist of multiple *rounds*, which acts as a multiplier for all links in the topology during that step.
81
+
MSCCL uses two related concepts, *steps and rounds*, to talk about the running time of algorithms. *Steps* is how many sequential sets of sends the algorithm consists of, where all sends inside a step execute in parallel. The number of sends between two nodes in a single step is limited by the bandwidth available in the topology. However, a step may consist of multiple *rounds*, which acts as a multiplier for all links in the topology during that step.
82
82
83
83
How much data a single round corresponds to depends on what is the actual size of a chunk at runtime, and how many chunks a collective uses can change (e.g. you can control this directly in the `instance` strategy by setting `--chunks N`). Thus for each collective the total data usage of different algorithms implementing it can be measured with their *rounds per chunk*.
84
84
85
-
SCCL provides a standalone analysis to find a lower bound for the *rounds per chunk* required by any instance. For example, to find the least rouds per chunk for an Alltoall in a DGX-1:
85
+
MSCCL provides a standalone analysis to find a lower bound for the *rounds per chunk* required by any instance. For example, to find the least rouds per chunk for an Alltoall in a DGX-1:
86
86
```
87
-
$ sccl analyze rounds DGX1 Gather
87
+
$ msccl analyze rounds DGX1 Gather
88
88
Gather(n=8,root=0) algorithms need at least 7/6 rounds in DGX1 topology.
89
89
```
90
90
In this case the bound happens to be tight and the `pareto-optimal` strategy would use it to detect that it has found a bandwidth optimal algorithm.
91
91
92
92
## Distributed Algorithms
93
93
94
-
SCCL provides routines to synthesize algorithms for distributed topologies under the `sccl distribute` subcommand. These work by using algorithms for a local collective and stitcing instances of it together to create a distributed one.
94
+
MSCCL provides routines to synthesize algorithms for distributed topologies under the `msccl distribute` subcommand. These work by using algorithms for a local collective and stitcing instances of it together to create a distributed one.
95
95
96
96
**Alltoall from Gather and Scatter:**`alltoall-gather-scatter` combines a Gather and a Scatter algorithm with a transpose step in the middle to form a distributed Alltoall algorithm. For example, an Alltoall algorithm for a cluster of 4 DGX-1 machines can be created with:
This distributor works with any Gather and Scatter algorithm, as long as their roots have a direct connection in the topology. SCCL also provides multi-root versions of Gather and Scatter that can be substituted here.
102
+
This distributor works with any Gather and Scatter algorithm, as long as their roots have a direct connection in the topology. MSCCL also provides multi-root versions of Gather and Scatter that can be substituted here.
0 commit comments