[1.8] Update api doc for enabling TcpStore on Windows (pytorch#52601)

malfet · gunandrose4u · web-flow · commit 63333e2a25da · 2021-02-22T10:14:09.000-08:00
Summary: Fixes #{issue number} Pull Request resolved: pytorch#51847 Reviewed By: albanD Differential Revision: D26405678 Pulled By: malfet fbshipit-source-id: 073b675225b48d1732771583f8f2473e0fdcf35c Co-authored-by: Joe Zhu <jozh@microsoft.com>
diff --git a/docs/source/distributed.rst b/docs/source/distributed.rst
@@ -58,16 +58,16 @@ distributed (NCCL only when building with CUDA). MPI is an optional backend that
 included if you build PyTorch from source. (e.g.building PyTorch on a host that has MPI
 installed.)
 
-.. warning ::
-    As of PyTorch v1.7, Windows support for the distributed package only covers collective
-    communications with Gloo backend, `FileStore`, and `DistributedDataParallel`. Therefore,
-    the `init_method` argument in :func:`init_process_group` must point to a file. This works
-    for both local and shared file systems:
+.. note ::
+    As of PyTorch v1.8, Windows supports all collective communications backend but NCCL,
+    If  the `init_method` argument of :func:`init_process_group` points to a file it must adhere
+    to the following schema:
 
     - Local file system, ``init_method="file:///d:/tmp/some_file"``
     - Shared file system, ``init_method="file://////{machine_name}/{share_folder_name}/some_file"``
 
-    Similarly, if you directly pass in a `store` argument, it must be a ``FileStore`` instance.
+    Same as on Linux platform, you can enable TcpStore by setting environment variables,
+    MASTER_ADDR and MASTER_PORT.
 
 Which backend to use?
 ^^^^^^^^^^^^^^^^^^^^^
@@ -330,13 +330,13 @@ as they should never be created manually, but they are guaranteed to support two
 
 Synchronous and asynchronous collective operations
 --------------------------------------------------
-Every collective operation function supports the following two kinds of operations, 
+Every collective operation function supports the following two kinds of operations,
 depending on the setting of the ``async_op`` flag passed into the collective:
 
 **Synchronous operation** - the default mode, when ``async_op`` is set to ``False``.
 When the function returns, it is guaranteed that
 the collective operation is performed. In the case of CUDA operations, it is not guaranteed
-that the CUDA operation is completed, since CUDA operations are asynchronous. For CPU collectives, any 
+that the CUDA operation is completed, since CUDA operations are asynchronous. For CPU collectives, any
 further function calls utilizing the output of the collective call will behave as expected. For CUDA collectives,
 function calls utilizing the output on the same CUDA stream will behave as expected. Users must take care of
 synchronization under the scenario of running under different streams. For details on CUDA semantics such as stream
@@ -347,12 +347,12 @@ See the below script to see examples of differences in these semantics for CPU a
 returns a distributed request object. In general, you don't need to create it manually and it
 is guaranteed to support two methods:
 
-* ``is_completed()`` - in the case of CPU collectives, returns ``True`` if completed. In the case of CUDA operations, 
-  returns ``True`` if the operation has been successfully enqueued onto a CUDA stream and the output can be utilized on the 
-  default stream without further synchronization. 
+* ``is_completed()`` - in the case of CPU collectives, returns ``True`` if completed. In the case of CUDA operations,
+  returns ``True`` if the operation has been successfully enqueued onto a CUDA stream and the output can be utilized on the
+  default stream without further synchronization.
 * ``wait()`` - in the case of CPU collectives, will block the process until the operation is completed. In the case
-  of CUDA collectives, will block until the operation has been successfully enqueued onto a CUDA stream and the 
-  output can be utilized on the default stream without further synchronization. 
+  of CUDA collectives, will block until the operation has been successfully enqueued onto a CUDA stream and the
+  output can be utilized on the default stream without further synchronization.
 
 **Example**
 
@@ -368,7 +368,7 @@ It shows the explicit need to synchronize when using collective outputs on diffe
     handle = dist.all_reduce(output, async_op=True)
     # Wait ensures the operation is enqueued, but not necessarily complete.
     handle.wait()
-    # Using result on non-default stream.    
+    # Using result on non-default stream.
     with torch.cuda.stream(s):
         s.wait_stream(torch.cuda.default_stream())
         output.add_(100)
@@ -382,7 +382,7 @@ It shows the explicit need to synchronize when using collective outputs on diffe
 Collective functions
 --------------------
 
-.. autofunction:: broadcast 
+.. autofunction:: broadcast
 
 .. autofunction:: broadcast_object_list
 
@@ -426,7 +426,7 @@ you can find an implementation of those in the `torch.distributed.nn.*` module.
 Functions here are synchronous and will be inserted in the autograd graph, so
 you need to ensure that all the processes that participated in the collective operation
 will do the backward pass for the backward communication to effectively happen and
-don't cause a deadlock. 
+don't cause a deadlock.
 
 Please notice that currently the only backend where all the functions are guaranteed to work is ``gloo``.
 .. autofunction:: torch.distributed.nn.broadcast