-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Remove references to TorchScript in docs #3453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 4 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
83789c5
Remove references to TorchScript in docs
gmagogsfm 79d4a82
Merge branch 'main' into ts_remove_docs
svekars 0a7cdab
Add missing todo and redirects
gmagogsfm a7f0363
Merge remote-tracking branch 'upstream/main' into ts_remove_docs
yiming0416 fb8aa15
Apply suggestions from code review
svekars File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,281 +1,3 @@ | ||
Dynamic Parallelism in TorchScript | ||
================================== | ||
|
||
.. warning:: TorchScript is no longer in active development. | ||
|
||
In this tutorial, we introduce the syntax for doing *dynamic inter-op parallelism* | ||
in TorchScript. This parallelism has the following properties: | ||
|
||
* dynamic - The number of parallel tasks created and their workload can depend on the control flow of the program. | ||
* inter-op - The parallelism is concerned with running TorchScript program fragments in parallel. This is distinct from *intra-op parallelism*, which is concerned with splitting up individual operators and running subsets of the operator's work in parallel. | ||
Basic Syntax | ||
------------ | ||
|
||
The two important APIs for dynamic parallelism are: | ||
|
||
* ``torch.jit.fork(fn : Callable[..., T], *args, **kwargs) -> torch.jit.Future[T]`` | ||
* ``torch.jit.wait(fut : torch.jit.Future[T]) -> T`` | ||
|
||
A good way to demonstrate how these work is by way of an example: | ||
|
||
.. code-block:: python | ||
|
||
import torch | ||
|
||
def foo(x): | ||
return torch.neg(x) | ||
|
||
@torch.jit.script | ||
def example(x): | ||
# Call `foo` using parallelism: | ||
# First, we "fork" off a task. This task will run `foo` with argument `x` | ||
future = torch.jit.fork(foo, x) | ||
|
||
# Call `foo` normally | ||
x_normal = foo(x) | ||
|
||
# Second, we "wait" on the task. Since the task may be running in | ||
# parallel, we have to "wait" for its result to become available. | ||
# Notice that by having lines of code between the "fork()" and "wait()" | ||
# call for a given Future, we can overlap computations so that they | ||
# run in parallel. | ||
x_parallel = torch.jit.wait(future) | ||
|
||
return x_normal, x_parallel | ||
|
||
print(example(torch.ones(1))) # (-1., -1.) | ||
|
||
|
||
``fork()`` takes the callable ``fn`` and arguments to that callable ``args`` | ||
and ``kwargs`` and creates an asynchronous task for the execution of ``fn``. | ||
``fn`` can be a function, method, or Module instance. ``fork()`` returns a | ||
reference to the value of the result of this execution, called a ``Future``. | ||
Because ``fork`` returns immediately after creating the async task, ``fn`` may | ||
not have been executed by the time the line of code after the ``fork()`` call | ||
is executed. Thus, ``wait()`` is used to wait for the async task to complete | ||
and return the value. | ||
|
||
These constructs can be used to overlap the execution of statements within a | ||
function (shown in the worked example section) or be composed with other language | ||
constructs like loops: | ||
|
||
.. code-block:: python | ||
|
||
import torch | ||
from typing import List | ||
|
||
def foo(x): | ||
return torch.neg(x) | ||
|
||
@torch.jit.script | ||
def example(x): | ||
futures : List[torch.jit.Future[torch.Tensor]] = [] | ||
for _ in range(100): | ||
futures.append(torch.jit.fork(foo, x)) | ||
|
||
results = [] | ||
for future in futures: | ||
results.append(torch.jit.wait(future)) | ||
|
||
return torch.sum(torch.stack(results)) | ||
|
||
print(example(torch.ones([]))) | ||
|
||
.. note:: | ||
|
||
When we initialized an empty list of Futures, we needed to add an explicit | ||
type annotation to ``futures``. In TorchScript, empty containers default | ||
to assuming they contain Tensor values, so we annotate the list constructor | ||
# as being of type ``List[torch.jit.Future[torch.Tensor]]`` | ||
|
||
This example uses ``fork()`` to launch 100 instances of the function ``foo``, | ||
waits on the 100 tasks to complete, then sums the results, returning ``-100.0``. | ||
|
||
Applied Example: Ensemble of Bidirectional LSTMs | ||
------------------------------------------------ | ||
|
||
Let's try to apply parallelism to a more realistic example and see what sort | ||
of performance we can get out of it. First, let's define the baseline model: an | ||
ensemble of bidirectional LSTM layers. | ||
|
||
.. code-block:: python | ||
|
||
import torch, time | ||
|
||
# In RNN parlance, the dimensions we care about are: | ||
# # of time-steps (T) | ||
# Batch size (B) | ||
# Hidden size/number of "channels" (C) | ||
T, B, C = 50, 50, 1024 | ||
|
||
# A module that defines a single "bidirectional LSTM". This is simply two | ||
# LSTMs applied to the same sequence, but one in reverse | ||
class BidirectionalRecurrentLSTM(torch.nn.Module): | ||
def __init__(self): | ||
super().__init__() | ||
self.cell_f = torch.nn.LSTM(input_size=C, hidden_size=C) | ||
self.cell_b = torch.nn.LSTM(input_size=C, hidden_size=C) | ||
|
||
def forward(self, x : torch.Tensor) -> torch.Tensor: | ||
# Forward layer | ||
output_f, _ = self.cell_f(x) | ||
|
||
# Backward layer. Flip input in the time dimension (dim 0), apply the | ||
# layer, then flip the outputs in the time dimension | ||
x_rev = torch.flip(x, dims=[0]) | ||
output_b, _ = self.cell_b(torch.flip(x, dims=[0])) | ||
output_b_rev = torch.flip(output_b, dims=[0]) | ||
|
||
return torch.cat((output_f, output_b_rev), dim=2) | ||
|
||
|
||
# An "ensemble" of `BidirectionalRecurrentLSTM` modules. The modules in the | ||
# ensemble are run one-by-one on the same input then their results are | ||
# stacked and summed together, returning the combined result. | ||
class LSTMEnsemble(torch.nn.Module): | ||
def __init__(self, n_models): | ||
super().__init__() | ||
self.n_models = n_models | ||
self.models = torch.nn.ModuleList([ | ||
BidirectionalRecurrentLSTM() for _ in range(self.n_models)]) | ||
|
||
def forward(self, x : torch.Tensor) -> torch.Tensor: | ||
results = [] | ||
for model in self.models: | ||
results.append(model(x)) | ||
return torch.stack(results).sum(dim=0) | ||
|
||
# For a head-to-head comparison to what we're going to do with fork/wait, let's | ||
# instantiate the model and compile it with TorchScript | ||
ens = torch.jit.script(LSTMEnsemble(n_models=4)) | ||
|
||
# Normally you would pull this input out of an embedding table, but for the | ||
# purpose of this demo let's just use random data. | ||
x = torch.rand(T, B, C) | ||
|
||
# Let's run the model once to warm up things like the memory allocator | ||
ens(x) | ||
|
||
x = torch.rand(T, B, C) | ||
|
||
# Let's see how fast it runs! | ||
s = time.time() | ||
ens(x) | ||
print('Inference took', time.time() - s, ' seconds') | ||
|
||
On my machine, this network runs in ``2.05`` seconds. We can do a lot better! | ||
|
||
Parallelizing Forward and Backward Layers | ||
----------------------------------------- | ||
|
||
A very simple thing we can do is parallelize the forward and backward layers | ||
within ``BidirectionalRecurrentLSTM``. For this, the structure of the computation | ||
is static, so we don't actually even need any loops. Let's rewrite the ``forward`` | ||
method of ``BidirectionalRecurrentLSTM`` like so: | ||
|
||
.. code-block:: python | ||
|
||
def forward(self, x : torch.Tensor) -> torch.Tensor: | ||
# Forward layer - fork() so this can run in parallel to the backward | ||
# layer | ||
future_f = torch.jit.fork(self.cell_f, x) | ||
|
||
# Backward layer. Flip input in the time dimension (dim 0), apply the | ||
# layer, then flip the outputs in the time dimension | ||
x_rev = torch.flip(x, dims=[0]) | ||
output_b, _ = self.cell_b(torch.flip(x, dims=[0])) | ||
output_b_rev = torch.flip(output_b, dims=[0]) | ||
|
||
# Retrieve the output from the forward layer. Note this needs to happen | ||
# *after* the stuff we want to parallelize with | ||
output_f, _ = torch.jit.wait(future_f) | ||
|
||
return torch.cat((output_f, output_b_rev), dim=2) | ||
|
||
In this example, ``forward()`` delegates execution of ``cell_f`` to another thread, | ||
while it continues to execute ``cell_b``. This causes the execution of both the | ||
cells to be overlapped with each other. | ||
|
||
Running the script again with this simple modification yields a runtime of | ||
``1.71`` seconds for an improvement of ``17%``! | ||
|
||
Aside: Visualizing Parallelism | ||
------------------------------ | ||
|
||
We're not done optimizing our model but it's worth introducing the tooling we | ||
have for visualizing performance. One important tool is the `PyTorch profiler <https://pytorch.org/docs/stable/autograd.html#profiler>`_. | ||
|
||
Let's use the profiler along with the Chrome trace export functionality to | ||
visualize the performance of our parallelized model: | ||
|
||
.. code-block:: python | ||
|
||
with torch.autograd.profiler.profile() as prof: | ||
ens(x) | ||
prof.export_chrome_trace('parallel.json') | ||
|
||
This snippet of code will write out a file named ``parallel.json``. If you | ||
navigate Google Chrome to ``chrome://tracing``, click the ``Load`` button, and | ||
load in that JSON file, you should see a timeline like the following: | ||
|
||
.. image:: https://i.imgur.com/rm5hdG9.png | ||
|
||
The horizontal axis of the timeline represents time and the vertical axis | ||
represents threads of execution. As we can see, we are running two ``lstm`` | ||
instances at a time. This is the result of our hard work parallelizing the | ||
bidirectional layers! | ||
|
||
Parallelizing Models in the Ensemble | ||
------------------------------------ | ||
|
||
You may have noticed that there is a further parallelization opportunity in our | ||
code: we can also run the models contained in ``LSTMEnsemble`` in parallel with | ||
each other. The way to do that is simple enough, this is how we should change | ||
the ``forward`` method of ``LSTMEnsemble``: | ||
|
||
.. code-block:: python | ||
|
||
def forward(self, x : torch.Tensor) -> torch.Tensor: | ||
# Launch tasks for each model | ||
futures : List[torch.jit.Future[torch.Tensor]] = [] | ||
for model in self.models: | ||
futures.append(torch.jit.fork(model, x)) | ||
|
||
# Collect the results from the launched tasks | ||
results : List[torch.Tensor] = [] | ||
for future in futures: | ||
results.append(torch.jit.wait(future)) | ||
|
||
return torch.stack(results).sum(dim=0) | ||
|
||
Or, if you value brevity, we can use list comprehensions: | ||
|
||
.. code-block:: python | ||
|
||
def forward(self, x : torch.Tensor) -> torch.Tensor: | ||
futures = [torch.jit.fork(model, x) for model in self.models] | ||
results = [torch.jit.wait(fut) for fut in futures] | ||
return torch.stack(results).sum(dim=0) | ||
|
||
Like described in the intro, we've used loops to fork off tasks for each of the | ||
models in our ensemble. We've then used another loop to wait for all of the | ||
tasks to be completed. This provides even more overlap of computation. | ||
|
||
With this small update, the script runs in ``1.4`` seconds, for a total speedup | ||
of ``32%``! Pretty good for two lines of code. | ||
|
||
We can also use the Chrome tracer again to see where's going on: | ||
|
||
.. image:: https://i.imgur.com/kA0gyQm.png | ||
|
||
We can now see that all ``LSTM`` instances are being run fully in parallel. | ||
|
||
Conclusion | ||
---------- | ||
|
||
In this tutorial, we learned about ``fork()`` and ``wait()``, the basic APIs | ||
for doing dynamic, inter-op parallelism in TorchScript. We saw a few typical | ||
usage patterns for using these functions to parallelize the execution of | ||
functions, methods, or ``Modules`` in TorchScript code. Finally, we worked through | ||
an example of optimizing a model using this technique and explored the performance | ||
measurement and visualization tooling available in PyTorch. | ||
.. warning:: | ||
TorchScript is deprecated, please use | ||
`torch.export<https://docs.pytorch.org/tutorials/intermediate/torch_export_tutorial.html>`_ instead. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.