Skip to content

Commit d5cd489

Browse files
gmagogsfmsvekarsyiming0416
authored
Remove references to TorchScript in docs (#3453)
* Remove references to TorchScript in docs * Add missing todo and redirects --------- Co-authored-by: Svetlana Karslioglu <[email protected]> Co-authored-by: Yiming Zhou <[email protected]>
1 parent 9a44439 commit d5cd489

23 files changed

+31
-3410
lines changed

advanced_source/cpp_export.rst

Lines changed: 3 additions & 389 deletions
Large diffs are not rendered by default.
Lines changed: 3 additions & 281 deletions
Original file line numberDiff line numberDiff line change
@@ -1,281 +1,3 @@
1-
Dynamic Parallelism in TorchScript
2-
==================================
3-
4-
.. warning:: TorchScript is no longer in active development.
5-
6-
In this tutorial, we introduce the syntax for doing *dynamic inter-op parallelism*
7-
in TorchScript. This parallelism has the following properties:
8-
9-
* dynamic - The number of parallel tasks created and their workload can depend on the control flow of the program.
10-
* inter-op - The parallelism is concerned with running TorchScript program fragments in parallel. This is distinct from *intra-op parallelism*, which is concerned with splitting up individual operators and running subsets of the operator's work in parallel.
11-
Basic Syntax
12-
------------
13-
14-
The two important APIs for dynamic parallelism are:
15-
16-
* ``torch.jit.fork(fn : Callable[..., T], *args, **kwargs) -> torch.jit.Future[T]``
17-
* ``torch.jit.wait(fut : torch.jit.Future[T]) -> T``
18-
19-
A good way to demonstrate how these work is by way of an example:
20-
21-
.. code-block:: python
22-
23-
import torch
24-
25-
def foo(x):
26-
return torch.neg(x)
27-
28-
@torch.jit.script
29-
def example(x):
30-
# Call `foo` using parallelism:
31-
# First, we "fork" off a task. This task will run `foo` with argument `x`
32-
future = torch.jit.fork(foo, x)
33-
34-
# Call `foo` normally
35-
x_normal = foo(x)
36-
37-
# Second, we "wait" on the task. Since the task may be running in
38-
# parallel, we have to "wait" for its result to become available.
39-
# Notice that by having lines of code between the "fork()" and "wait()"
40-
# call for a given Future, we can overlap computations so that they
41-
# run in parallel.
42-
x_parallel = torch.jit.wait(future)
43-
44-
return x_normal, x_parallel
45-
46-
print(example(torch.ones(1))) # (-1., -1.)
47-
48-
49-
``fork()`` takes the callable ``fn`` and arguments to that callable ``args``
50-
and ``kwargs`` and creates an asynchronous task for the execution of ``fn``.
51-
``fn`` can be a function, method, or Module instance. ``fork()`` returns a
52-
reference to the value of the result of this execution, called a ``Future``.
53-
Because ``fork`` returns immediately after creating the async task, ``fn`` may
54-
not have been executed by the time the line of code after the ``fork()`` call
55-
is executed. Thus, ``wait()`` is used to wait for the async task to complete
56-
and return the value.
57-
58-
These constructs can be used to overlap the execution of statements within a
59-
function (shown in the worked example section) or be composed with other language
60-
constructs like loops:
61-
62-
.. code-block:: python
63-
64-
import torch
65-
from typing import List
66-
67-
def foo(x):
68-
return torch.neg(x)
69-
70-
@torch.jit.script
71-
def example(x):
72-
futures : List[torch.jit.Future[torch.Tensor]] = []
73-
for _ in range(100):
74-
futures.append(torch.jit.fork(foo, x))
75-
76-
results = []
77-
for future in futures:
78-
results.append(torch.jit.wait(future))
79-
80-
return torch.sum(torch.stack(results))
81-
82-
print(example(torch.ones([])))
83-
84-
.. note::
85-
86-
When we initialized an empty list of Futures, we needed to add an explicit
87-
type annotation to ``futures``. In TorchScript, empty containers default
88-
to assuming they contain Tensor values, so we annotate the list constructor
89-
# as being of type ``List[torch.jit.Future[torch.Tensor]]``
90-
91-
This example uses ``fork()`` to launch 100 instances of the function ``foo``,
92-
waits on the 100 tasks to complete, then sums the results, returning ``-100.0``.
93-
94-
Applied Example: Ensemble of Bidirectional LSTMs
95-
------------------------------------------------
96-
97-
Let's try to apply parallelism to a more realistic example and see what sort
98-
of performance we can get out of it. First, let's define the baseline model: an
99-
ensemble of bidirectional LSTM layers.
100-
101-
.. code-block:: python
102-
103-
import torch, time
104-
105-
# In RNN parlance, the dimensions we care about are:
106-
# # of time-steps (T)
107-
# Batch size (B)
108-
# Hidden size/number of "channels" (C)
109-
T, B, C = 50, 50, 1024
110-
111-
# A module that defines a single "bidirectional LSTM". This is simply two
112-
# LSTMs applied to the same sequence, but one in reverse
113-
class BidirectionalRecurrentLSTM(torch.nn.Module):
114-
def __init__(self):
115-
super().__init__()
116-
self.cell_f = torch.nn.LSTM(input_size=C, hidden_size=C)
117-
self.cell_b = torch.nn.LSTM(input_size=C, hidden_size=C)
118-
119-
def forward(self, x : torch.Tensor) -> torch.Tensor:
120-
# Forward layer
121-
output_f, _ = self.cell_f(x)
122-
123-
# Backward layer. Flip input in the time dimension (dim 0), apply the
124-
# layer, then flip the outputs in the time dimension
125-
x_rev = torch.flip(x, dims=[0])
126-
output_b, _ = self.cell_b(torch.flip(x, dims=[0]))
127-
output_b_rev = torch.flip(output_b, dims=[0])
128-
129-
return torch.cat((output_f, output_b_rev), dim=2)
130-
131-
132-
# An "ensemble" of `BidirectionalRecurrentLSTM` modules. The modules in the
133-
# ensemble are run one-by-one on the same input then their results are
134-
# stacked and summed together, returning the combined result.
135-
class LSTMEnsemble(torch.nn.Module):
136-
def __init__(self, n_models):
137-
super().__init__()
138-
self.n_models = n_models
139-
self.models = torch.nn.ModuleList([
140-
BidirectionalRecurrentLSTM() for _ in range(self.n_models)])
141-
142-
def forward(self, x : torch.Tensor) -> torch.Tensor:
143-
results = []
144-
for model in self.models:
145-
results.append(model(x))
146-
return torch.stack(results).sum(dim=0)
147-
148-
# For a head-to-head comparison to what we're going to do with fork/wait, let's
149-
# instantiate the model and compile it with TorchScript
150-
ens = torch.jit.script(LSTMEnsemble(n_models=4))
151-
152-
# Normally you would pull this input out of an embedding table, but for the
153-
# purpose of this demo let's just use random data.
154-
x = torch.rand(T, B, C)
155-
156-
# Let's run the model once to warm up things like the memory allocator
157-
ens(x)
158-
159-
x = torch.rand(T, B, C)
160-
161-
# Let's see how fast it runs!
162-
s = time.time()
163-
ens(x)
164-
print('Inference took', time.time() - s, ' seconds')
165-
166-
On my machine, this network runs in ``2.05`` seconds. We can do a lot better!
167-
168-
Parallelizing Forward and Backward Layers
169-
-----------------------------------------
170-
171-
A very simple thing we can do is parallelize the forward and backward layers
172-
within ``BidirectionalRecurrentLSTM``. For this, the structure of the computation
173-
is static, so we don't actually even need any loops. Let's rewrite the ``forward``
174-
method of ``BidirectionalRecurrentLSTM`` like so:
175-
176-
.. code-block:: python
177-
178-
def forward(self, x : torch.Tensor) -> torch.Tensor:
179-
# Forward layer - fork() so this can run in parallel to the backward
180-
# layer
181-
future_f = torch.jit.fork(self.cell_f, x)
182-
183-
# Backward layer. Flip input in the time dimension (dim 0), apply the
184-
# layer, then flip the outputs in the time dimension
185-
x_rev = torch.flip(x, dims=[0])
186-
output_b, _ = self.cell_b(torch.flip(x, dims=[0]))
187-
output_b_rev = torch.flip(output_b, dims=[0])
188-
189-
# Retrieve the output from the forward layer. Note this needs to happen
190-
# *after* the stuff we want to parallelize with
191-
output_f, _ = torch.jit.wait(future_f)
192-
193-
return torch.cat((output_f, output_b_rev), dim=2)
194-
195-
In this example, ``forward()`` delegates execution of ``cell_f`` to another thread,
196-
while it continues to execute ``cell_b``. This causes the execution of both the
197-
cells to be overlapped with each other.
198-
199-
Running the script again with this simple modification yields a runtime of
200-
``1.71`` seconds for an improvement of ``17%``!
201-
202-
Aside: Visualizing Parallelism
203-
------------------------------
204-
205-
We're not done optimizing our model but it's worth introducing the tooling we
206-
have for visualizing performance. One important tool is the `PyTorch profiler <https://pytorch.org/docs/stable/autograd.html#profiler>`_.
207-
208-
Let's use the profiler along with the Chrome trace export functionality to
209-
visualize the performance of our parallelized model:
210-
211-
.. code-block:: python
212-
213-
with torch.autograd.profiler.profile() as prof:
214-
ens(x)
215-
prof.export_chrome_trace('parallel.json')
216-
217-
This snippet of code will write out a file named ``parallel.json``. If you
218-
navigate Google Chrome to ``chrome://tracing``, click the ``Load`` button, and
219-
load in that JSON file, you should see a timeline like the following:
220-
221-
.. image:: https://i.imgur.com/rm5hdG9.png
222-
223-
The horizontal axis of the timeline represents time and the vertical axis
224-
represents threads of execution. As we can see, we are running two ``lstm``
225-
instances at a time. This is the result of our hard work parallelizing the
226-
bidirectional layers!
227-
228-
Parallelizing Models in the Ensemble
229-
------------------------------------
230-
231-
You may have noticed that there is a further parallelization opportunity in our
232-
code: we can also run the models contained in ``LSTMEnsemble`` in parallel with
233-
each other. The way to do that is simple enough, this is how we should change
234-
the ``forward`` method of ``LSTMEnsemble``:
235-
236-
.. code-block:: python
237-
238-
def forward(self, x : torch.Tensor) -> torch.Tensor:
239-
# Launch tasks for each model
240-
futures : List[torch.jit.Future[torch.Tensor]] = []
241-
for model in self.models:
242-
futures.append(torch.jit.fork(model, x))
243-
244-
# Collect the results from the launched tasks
245-
results : List[torch.Tensor] = []
246-
for future in futures:
247-
results.append(torch.jit.wait(future))
248-
249-
return torch.stack(results).sum(dim=0)
250-
251-
Or, if you value brevity, we can use list comprehensions:
252-
253-
.. code-block:: python
254-
255-
def forward(self, x : torch.Tensor) -> torch.Tensor:
256-
futures = [torch.jit.fork(model, x) for model in self.models]
257-
results = [torch.jit.wait(fut) for fut in futures]
258-
return torch.stack(results).sum(dim=0)
259-
260-
Like described in the intro, we've used loops to fork off tasks for each of the
261-
models in our ensemble. We've then used another loop to wait for all of the
262-
tasks to be completed. This provides even more overlap of computation.
263-
264-
With this small update, the script runs in ``1.4`` seconds, for a total speedup
265-
of ``32%``! Pretty good for two lines of code.
266-
267-
We can also use the Chrome tracer again to see where's going on:
268-
269-
.. image:: https://i.imgur.com/kA0gyQm.png
270-
271-
We can now see that all ``LSTM`` instances are being run fully in parallel.
272-
273-
Conclusion
274-
----------
275-
276-
In this tutorial, we learned about ``fork()`` and ``wait()``, the basic APIs
277-
for doing dynamic, inter-op parallelism in TorchScript. We saw a few typical
278-
usage patterns for using these functions to parallelize the execution of
279-
functions, methods, or ``Modules`` in TorchScript code. Finally, we worked through
280-
an example of optimizing a model using this technique and explored the performance
281-
measurement and visualization tooling available in PyTorch.
1+
.. warning::
2+
TorchScript is deprecated, please use
3+
`torch.export <https://docs.pytorch.org/tutorials/intermediate/torch_export_tutorial.html>`__ instead.

0 commit comments

Comments
 (0)