|
1 |
| -`Introduction <ddp_series_intro.html>`__ \|\| **What is DDP** \|\| |
2 |
| -`Single-Node Multi-GPU Training <ddp_series_multigpu.html>`__ \|\| |
3 |
| -`Fault Tolerance <ddp_series_fault_tolerance.html>`__ \|\| |
4 |
| -`Multi-Node training <../intermediate/ddp_series_multinode.html>`__ \|\| |
5 |
| -`minGPT Training <../intermediate/ddp_series_minGPT.html>`__ |
| 1 | +`์๊ฐ <ddp_series_intro.html>`__ \|\| **๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ ์ฒ๋ฆฌ (DDP) ๋ ๋ฌด์์ธ๊ฐ?** \|\| |
| 2 | +`๋จ์ผ ๋
ธ๋ ๋ค์ค-GPU ํ์ต <ddp_series_multigpu.html>`__ \|\| |
| 3 | +`๊ฒฐํจ ๋ด์ฑ <ddp_series_fault_tolerance.html>`__ \|\| |
| 4 | +`๋ค์ค ๋
ธ๋ ํ์ต <../intermediate/ddp_series_multinode.html>`__ \|\| |
| 5 | +`minGPT ํ์ต <../intermediate/ddp_series_minGPT.html>`__ |
6 | 6 |
|
7 |
| -What is Distributed Data Parallel (DDP) |
| 7 | +๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ ์ฒ๋ฆฌ (DDP) ๋ ๋ฌด์์ธ๊ฐ? |
8 | 8 | =======================================
|
9 | 9 |
|
10 |
| -Authors: `Suraj Subramanian <https://github.com/suraj813>`__ |
| 10 | +์ ์: `Suraj Subramanian <https://github.com/suraj813>`__ |
| 11 | +๋ฒ์ญ: `๋ฐ์ง์ <https://github.com/rumjie>`__ |
11 | 12 |
|
12 | 13 | .. grid:: 2
|
13 | 14 |
|
14 |
| - .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn |
| 15 | + .. grid-item-card:: :octicon:`mortar-board;1em;` ์ด ์ฅ์์ ๋ฐฐ์ฐ๋ ๊ฒ |
15 | 16 |
|
16 |
| - * How DDP works under the hood |
17 |
| - * What is ``DistributedSampler`` |
18 |
| - * How gradients are synchronized across GPUs |
| 17 | + * DDP ์ ๋ด๋ถ ์๋ ์๋ฆฌ |
| 18 | + * ``DistributedSampler`` ์ด๋ ๋ฌด์์ธ๊ฐ? |
| 19 | + * GPU ๊ฐ ๋ณํ๋๊ฐ ๋๊ธฐํ๋๋ ๋ฐฉ๋ฒ |
19 | 20 |
|
20 | 21 |
|
21 |
| - .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites |
| 22 | + .. grid-item-card:: :octicon:`list-unordered;1em;` ํ์ ์ฌํญ |
22 | 23 |
|
23 |
| - * Familiarity with `basic non-distributed training <https://tutorials.pytorch.kr/beginner/basics/quickstart_tutorial.html>`__ in PyTorch |
| 24 | + * ํ์ดํ ์น `๋น๋ถ์ฐ ํ์ต <https://tutorials.pytorch.kr/beginner/basics/quickstart_tutorial.html>`__ ์ ์ต์ํ ๊ฒ |
24 | 25 |
|
25 |
| -Follow along with the video below or on `youtube <https://www.youtube.com/watch/Cvdhwx-OBBo>`__. |
| 26 | +์๋์ ์์์ด๋ `์ ํฌ๋ธ ์์ youtube <https://www.youtube.com/watch/Cvdhwx-OBBo>`__ ์ ๋ฐ๋ผ ์งํํ์ธ์. |
26 | 27 |
|
27 | 28 | .. raw:: html
|
28 | 29 |
|
29 | 30 | <div style="margin-top:10px; margin-bottom:10px;">
|
30 | 31 | <iframe width="560" height="315" src="https://www.youtube.com/embed/Cvdhwx-OBBo" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
31 | 32 | </div>
|
32 | 33 |
|
33 |
| -This tutorial is a gentle introduction to PyTorch `DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__ (DDP) |
34 |
| -which enables data parallel training in PyTorch. Data parallelism is a way to |
35 |
| -process multiple data batches across multiple devices simultaneously |
36 |
| -to achieve better performance. In PyTorch, the `DistributedSampler <https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler>`__ |
37 |
| -ensures each device gets a non-overlapping input batch. The model is replicated on all the devices; |
38 |
| -each replica calculates gradients and simultaneously synchronizes with the others using the `ring all-reduce |
39 |
| -algorithm <https://tech.preferred.jp/en/blog/technologies-behind-distributed-deep-learning-allreduce/>`__. |
| 34 | +์ด ํํ ๋ฆฌ์ผ์ ํ์ดํ ์น์์ ๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ ํ์ต์ ๊ฐ๋ฅํ๊ฒ ํ๋ `๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__ (DDP) |
| 35 | +์ ๋ํด ์๊ฐํฉ๋๋ค. ๋ฐ์ดํฐ ๋ณ๋ ฌ ์ฒ๋ฆฌ๋ ๋ ๋์ ์ฑ๋ฅ์ ๋ฌ์ฑํ๊ธฐ ์ํด |
| 36 | +์ฌ๋ฌ ๊ฐ์ ๋๋ฐ์ด์ค์์ ์ฌ๋ฌ ๋ฐ์ดํฐ ๋ฐฐ์น๋ค์ ๋์์ ์ฒ๋ฆฌํ๋ ๋ฐฉ๋ฒ์
๋๋ค. |
| 37 | +ํ์ดํ ์น์์, `๋ถ์ฐ ์ํ๋ฌ <https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler>`__ ๋ |
| 38 | +๊ฐ ๋๋ฐ์ด์ค๊ฐ ์๋ก ๋ค๋ฅธ ์
๋ ฅ ๋ฐฐ์น๋ฅผ ๋ฐ๋ ๊ฒ์ ๋ณด์ฅํฉ๋๋ค. |
| 39 | +๋ชจ๋ธ์ ๋ชจ๋ ๋๋ฐ์ด์ค์ ๋ณต์ ๋๋ฉฐ, ๊ฐ ์ฌ๋ณธ์ ๋ณํ๋๋ฅผ ๊ณ์ฐํ๋ ๋์์ `Ring-All-Reduce |
| 40 | +์๊ณ ๋ฆฌ์ฆ <https://tech.preferred.jp/en/blog/technologies-behind-distributed-deep-learning-allreduce/>`__ ์ ์ฌ์ฉํด ๋ค๋ฅธ ์ฌ๋ณธ๊ณผ ๋๊ธฐํ๋ฉ๋๋ค. |
40 | 41 |
|
41 |
| -This `illustrative tutorial <https://tutorials.pytorch.kr/intermediate/dist_tuto.html#>`__ provides a more in-depth python view of the mechanics of DDP. |
| 42 | +`์์ ํํ ๋ฆฌ์ผ <https://tutorials.pytorch.kr/intermediate/dist_tuto.html#>`__ ์์ DDP ๋ฉ์ปค๋์ฆ์ ๋ํด ํ์ด์ฌ ๊ด์ ์์ ์ฌ๋ ์๋ ์ค๋ช
์ ๋ณผ ์ ์์ต๋๋ค. |
42 | 43 |
|
43 |
| -Why you should prefer DDP over ``DataParallel`` (DP) |
| 44 | +``๋ฐ์ดํฐ ๋ณ๋ ฌ DataParallel`` (DP) ๋ณด๋ค DDP๊ฐ ๋์ ์ด์ |
44 | 45 | ----------------------------------------------------
|
45 | 46 |
|
46 |
| -`DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__ |
47 |
| -is an older approach to data parallelism. DP is trivially simple (with just one extra line of code) but it is much less performant. |
48 |
| -DDP improves upon the architecture in a few ways: |
49 |
| - |
50 |
| -+---------------------------------------+------------------------------+ |
51 |
| -| ``DataParallel`` | ``DistributedDataParallel`` | |
52 |
| -+=======================================+==============================+ |
53 |
| -| More overhead; model is replicated | Model is replicated only | |
54 |
| -| and destroyed at each forward pass | once | |
55 |
| -+---------------------------------------+------------------------------+ |
56 |
| -| Only supports single-node parallelism | Supports scaling to multiple | |
57 |
| -| | machines | |
58 |
| -+---------------------------------------+------------------------------+ |
59 |
| -| Slower; uses multithreading on a | Faster (no GIL contention) | |
60 |
| -| single process and runs into Global | because it uses | |
61 |
| -| Interpreter Lock (GIL) contention | multiprocessing | |
62 |
| -+---------------------------------------+------------------------------+ |
63 |
| - |
64 |
| -Further Reading |
| 47 | +`DP <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__ ๋ ๋ฐ์ดํฐ ๋ณ๋ ฌ ์ฒ๋ฆฌ์ ์ด์ ์ ๊ทผ ๋ฐฉ์์
๋๋ค. |
| 48 | +DP ๋ ๊ฐ๋จํ์ง๋ง, (ํ ์ค๋ง ์ถ๊ฐํ๋ฉด ๋จ) ์ฑ๋ฅ์ ํจ์ฌ ๋จ์ด์ง๋๋ค. DDP๋ ์๋์ ๊ฐ์ ๋ฐฉ์์ผ๋ก ์ํคํ
์ฒ๋ฅผ ๊ฐ์ ํฉ๋๋ค. |
| 49 | + |
| 50 | +.. list-table:: |
| 51 | + :header-rows: 1 |
| 52 | + |
| 53 | + * - ``DataParallel`` |
| 54 | + - ``DistributedDataParallel`` |
| 55 | + * - ์์
๋ถํ๊ฐ ํผ, ์ ํ๋ ๋๋ง๋ค ๋ชจ๋ธ์ด ๋ณต์ ๋ฐ ์ญ์ ๋จ |
| 56 | + - ๋ชจ๋ธ์ด ํ ๋ฒ๋ง ๋ณต์ ๋จ |
| 57 | + * - ๋จ์ผ ๋
ธ๋ ๋ณ๋ ฌ ์ฒ๋ฆฌ๋ง ๊ฐ๋ฅ |
| 58 | + - ์ฌ๋ฌ ๋จธ์ ์ผ๋ก ํ์ฅ ๊ฐ๋ฅ |
| 59 | + * - ๋๋ฆผ, ๋จ์ผ ํ๋ก์ธ์ค์์ ๋ฉํฐ ์ค๋ ๋ฉ์ ์ฌ์ฉํ๊ธฐ ๋๋ฌธ์ Global Interpreter Lock (GIL) ์ถฉ๋์ด ๋ฐ์ |
| 60 | + - ๋น ๋ฆ, ๋ฉํฐ ํ๋ก์ธ์ฑ์ ์ฌ์ฉํ๊ธฐ ๋๋ฌธ์ GIL ์ถฉ๋ ์์ |
| 61 | + |
| 62 | + |
| 63 | +์ฝ์๊ฑฐ๋ฆฌ |
65 | 64 | ---------------
|
66 | 65 |
|
67 |
| -- `Multi-GPU training with DDP <ddp_series_multigpu.html>`__ (next tutorial in this series) |
| 66 | +- `Multi-GPU training with DDP <ddp_series_multigpu.html>`__ (์ด ์๋ฆฌ์ฆ์ ๋ค์ ํํ ๋ฆฌ์ผ) |
68 | 67 | - `DDP
|
69 | 68 | API <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
|
70 | 69 | - `DDP Internal
|
|
0 commit comments