@@ -58,16 +58,16 @@ distributed (NCCL only when building with CUDA). MPI is an optional backend that
58
58
included if you build PyTorch from source. (e.g.building PyTorch on a host that has MPI
59
59
installed.)
60
60
61
- .. warning ::
62
- As of PyTorch v1.7, Windows support for the distributed package only covers collective
63
- communications with Gloo backend, `FileStore`, and `DistributedDataParallel`. Therefore,
64
- the `init_method` argument in :func:`init_process_group` must point to a file. This works
65
- for both local and shared file systems:
61
+ .. note ::
62
+ As of PyTorch v1.8, Windows supports all collective communications backend but NCCL,
63
+ If the `init_method` argument of :func:`init_process_group` points to a file it must adhere
64
+ to the following schema:
66
65
67
66
- Local file system, ``init_method="file:///d:/tmp/some_file"``
68
67
- Shared file system, ``init_method="file://////{machine_name}/{share_folder_name}/some_file"``
69
68
70
- Similarly, if you directly pass in a `store` argument, it must be a ``FileStore`` instance.
69
+ Same as on Linux platform, you can enable TcpStore by setting environment variables,
70
+ MASTER_ADDR and MASTER_PORT.
71
71
72
72
Which backend to use?
73
73
^^^^^^^^^^^^^^^^^^^^^
@@ -330,13 +330,13 @@ as they should never be created manually, but they are guaranteed to support two
330
330
331
331
Synchronous and asynchronous collective operations
332
332
--------------------------------------------------
333
- Every collective operation function supports the following two kinds of operations,
333
+ Every collective operation function supports the following two kinds of operations,
334
334
depending on the setting of the ``async_op `` flag passed into the collective:
335
335
336
336
**Synchronous operation ** - the default mode, when ``async_op `` is set to ``False ``.
337
337
When the function returns, it is guaranteed that
338
338
the collective operation is performed. In the case of CUDA operations, it is not guaranteed
339
- that the CUDA operation is completed, since CUDA operations are asynchronous. For CPU collectives, any
339
+ that the CUDA operation is completed, since CUDA operations are asynchronous. For CPU collectives, any
340
340
further function calls utilizing the output of the collective call will behave as expected. For CUDA collectives,
341
341
function calls utilizing the output on the same CUDA stream will behave as expected. Users must take care of
342
342
synchronization under the scenario of running under different streams. For details on CUDA semantics such as stream
@@ -347,12 +347,12 @@ See the below script to see examples of differences in these semantics for CPU a
347
347
returns a distributed request object. In general, you don't need to create it manually and it
348
348
is guaranteed to support two methods:
349
349
350
- * ``is_completed() `` - in the case of CPU collectives, returns ``True `` if completed. In the case of CUDA operations,
351
- returns ``True `` if the operation has been successfully enqueued onto a CUDA stream and the output can be utilized on the
352
- default stream without further synchronization.
350
+ * ``is_completed() `` - in the case of CPU collectives, returns ``True `` if completed. In the case of CUDA operations,
351
+ returns ``True `` if the operation has been successfully enqueued onto a CUDA stream and the output can be utilized on the
352
+ default stream without further synchronization.
353
353
* ``wait() `` - in the case of CPU collectives, will block the process until the operation is completed. In the case
354
- of CUDA collectives, will block until the operation has been successfully enqueued onto a CUDA stream and the
355
- output can be utilized on the default stream without further synchronization.
354
+ of CUDA collectives, will block until the operation has been successfully enqueued onto a CUDA stream and the
355
+ output can be utilized on the default stream without further synchronization.
356
356
357
357
**Example **
358
358
@@ -368,7 +368,7 @@ It shows the explicit need to synchronize when using collective outputs on diffe
368
368
handle = dist.all_reduce(output, async_op=True)
369
369
# Wait ensures the operation is enqueued, but not necessarily complete.
370
370
handle.wait()
371
- # Using result on non-default stream.
371
+ # Using result on non-default stream.
372
372
with torch.cuda.stream(s):
373
373
s.wait_stream(torch.cuda.default_stream())
374
374
output.add_(100)
@@ -382,7 +382,7 @@ It shows the explicit need to synchronize when using collective outputs on diffe
382
382
Collective functions
383
383
--------------------
384
384
385
- .. autofunction :: broadcast
385
+ .. autofunction :: broadcast
386
386
387
387
.. autofunction :: broadcast_object_list
388
388
@@ -426,7 +426,7 @@ you can find an implementation of those in the `torch.distributed.nn.*` module.
426
426
Functions here are synchronous and will be inserted in the autograd graph, so
427
427
you need to ensure that all the processes that participated in the collective operation
428
428
will do the backward pass for the backward communication to effectively happen and
429
- don't cause a deadlock.
429
+ don't cause a deadlock.
430
430
431
431
Please notice that currently the only backend where all the functions are guaranteed to work is ``gloo ``.
432
432
.. autofunction :: torch.distributed.nn.broadcast
0 commit comments