|
1 |
| -**Introduction** \|\| `What is DDP <ddp_series_theory.html>`__ \|\| |
2 |
| -`Single-Node Multi-GPU Training <ddp_series_multigpu.html>`__ \|\| |
3 |
| -`Fault Tolerance <ddp_series_fault_tolerance.html>`__ \|\| |
4 |
| -`Multi-Node training <../intermediate/ddp_series_multinode.html>`__ \|\| |
5 |
| -`minGPT Training <../intermediate/ddp_series_minGPT.html>`__ |
| 1 | +**์๊ฐ** \|\| `DDP๋ ๋ฌด์์ธ๊ฐ <ddp_series_theory.html>`__ \|\| |
| 2 | +`๋จ์ผ ๋
ธ๋ ๋ค์ค-GPU ํ์ต <ddp_series_multigpu.html>`__ \|\| |
| 3 | +`์ฅ์ ๋ด์ฑ <ddp_series_fault_tolerance.html>`__ \|\| |
| 4 | +`๋ค์ค ๋
ธ๋ ํ์ต <../intermediate/ddp_series_multinode.html>`__ \|\| |
| 5 | +`minGPT ํ์ต <../intermediate/ddp_series_minGPT.html>`__ |
6 | 6 |
|
7 |
| -Distributed Data Parallel in PyTorch - Video Tutorials |
8 |
| -====================================================== |
| 7 | +PyTorch์ ๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ ์ฒ๋ฆฌ - ๋น๋์ค ํํ ๋ฆฌ์ผ |
| 8 | +===================================================== |
9 | 9 |
|
10 |
| -Authors: `Suraj Subramanian <https://github.com/suraj813>`__ |
| 10 | +์ ์: `Suraj Subramanian <https://github.com/suraj813>`__ |
11 | 11 |
|
12 |
| -Follow along with the video below or on `youtube <https://www.youtube.com/watch/-K3bZYHYHEA>`__. |
| 12 | +์๋ ๋น๋์ค๋ฅผ ๋ณด๊ฑฐ๋ `YouTube <https://www.youtube.com/watch/-K3bZYHYHEA>`__์์ ํจ๊ป ์์ฒญํ์ธ์. |
13 | 13 |
|
14 | 14 | .. raw:: html
|
15 | 15 |
|
16 | 16 | <div style="margin-top:10px; margin-bottom:10px;">
|
17 | 17 | <iframe width="560" height="315" src="https://www.youtube.com/embed/-K3bZYHYHEA" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
18 | 18 | </div>
|
19 | 19 |
|
20 |
| -This series of video tutorials walks you through distributed training in |
21 |
| -PyTorch via DDP. |
22 |
| - |
23 |
| -The series starts with a simple non-distributed training job, and ends |
24 |
| -with deploying a training job across several machines in a cluster. |
25 |
| -Along the way, you will also learn about |
26 |
| -`torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__ for |
27 |
| -fault-tolerant distributed training. |
28 |
| - |
29 |
| -The tutorial assumes a basic familiarity with model training in PyTorch. |
30 |
| - |
31 |
| -Running the code |
32 |
| ----------------- |
33 |
| - |
34 |
| -You will need multiple CUDA GPUs to run the tutorial code. Typically, |
35 |
| -this can be done on a cloud instance with multiple GPUs (the tutorials |
36 |
| -use an Amazon EC2 P3 instance with 4 GPUs). |
37 |
| - |
38 |
| -The tutorial code is hosted in this |
39 |
| -`github repo <https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series>`__. |
40 |
| -Clone the repository and follow along! |
41 |
| - |
42 |
| -Tutorial sections |
43 |
| ------------------ |
44 |
| - |
45 |
| -0. Introduction (this page) |
46 |
| -1. `What is DDP? <ddp_series_theory.html>`__ Gently introduces what DDP is doing |
47 |
| - under the hood |
48 |
| -2. `Single-Node Multi-GPU Training <ddp_series_multigpu.html>`__ Training models |
49 |
| - using multiple GPUs on a single machine |
50 |
| -3. `Fault-tolerant distributed training <ddp_series_fault_tolerance.html>`__ |
51 |
| - Making your distributed training job robust with torchrun |
52 |
| -4. `Multi-Node training <../intermediate/ddp_series_multinode.html>`__ Training models using |
53 |
| - multiple GPUs on multiple machines |
54 |
| -5. `Training a GPT model with DDP <../intermediate/ddp_series_minGPT.html>`__ โReal-worldโ |
55 |
| - example of training a `minGPT <https://github.com/karpathy/minGPT>`__ |
56 |
| - model with DDP |
| 20 | +์ด ๋น๋์ค ํํ ๋ฆฌ์ผ ์๋ฆฌ์ฆ๋ PyTorch์์ DDP(Distributed Data Parallel)๋ฅผ ์ฌ์ฉํ ๋ถ์ฐ ํ์ต์ ๋ํด ์๋ดํฉ๋๋ค. |
| 21 | + |
| 22 | +์ด ์๋ฆฌ์ฆ๋ ๋จ์ํ ๋น๋ถ์ฐ ํ์ต ์์
์์ ์์ํ์ฌ, ํด๋ฌ์คํฐ ๋ด ์ฌ๋ฌ ๊ธฐ๊ธฐ๋ค(multiple machines)์์ ํ์ต ์์
์ ๋ฐฐํฌํ๋ ๊ฒ์ผ๋ก ๋ง๋ฌด๋ฆฌ๋ฉ๋๋ค. ์ด ๊ณผ์ ์์ `torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__์ ์ฌ์ฉํ ์ฅ์ ํ์ฉ(fault-tolerant) ๋ถ์ฐ ํ์ต์ ๋ํด์๋ ๋ฐฐ์ฐ๊ฒ ๋ฉ๋๋ค. |
| 23 | + |
| 24 | +์ด ํํ ๋ฆฌ์ผ์ PyTorch์์ ๋ชจ๋ธ ํ์ต์ ๋ํ ๊ธฐ๋ณธ์ ์ธ ์ดํด๋ฅผ ์ ์ ๋ก ํฉ๋๋ค. |
| 25 | + |
| 26 | +์ฝ๋ ์คํ |
| 27 | +-------- |
| 28 | + |
| 29 | +ํํ ๋ฆฌ์ผ ์ฝ๋๋ฅผ ์คํํ๋ ค๋ฉด ์ฌ๋ฌ ๊ฐ์ CUDA GPU๊ฐ ํ์ํฉ๋๋ค. ์ผ๋ฐ์ ์ผ๋ก ์ฌ๋ฌ GPU๊ฐ ์๋ ํด๋ผ์ฐ๋ ์ธ์คํด์ค์์ ์ด๋ฅผ ์ํํ ์ ์์ผ๋ฉฐ, ํํ ๋ฆฌ์ผ์์๋ 4๊ฐ์ GPU๊ฐ ํ์ฌ๋ Amazon EC2 P3 ์ธ์คํด์ค๋ฅผ ์ฌ์ฉํฉ๋๋ค. |
| 30 | + |
| 31 | +ํํ ๋ฆฌ์ผ ์ฝ๋๋ ์ด `GitHub ์ ์ฅ์ <https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series>`__์ ํธ์คํ
๋์ด ์์ต๋๋ค. ์ ์ฅ์๋ฅผ ๋ณต์ ํ๊ณ ํจ๊ป ์งํํ์ธ์! |
| 32 | + |
| 33 | +ํํ ๋ฆฌ์ผ ์น์
|
| 34 | +-------------- |
| 35 | + |
| 36 | +0. ์๊ฐ (์ด ํ์ด์ง) |
| 37 | +1. `DDP๋ ๋ฌด์์ธ๊ฐ? <ddp_series_theory.html>`__ DDP๊ฐ ๋ด๋ถ์ ์ผ๋ก ์ํํ๋ ์์
์ ๋ํด ๊ฐ๋จํ ์๊ฐํฉ๋๋ค. |
| 38 | +2. `์ฑ๊ธ ๋
ธ๋ ๋ฉํฐ-GPU ํ์ต <ddp_series_multigpu.html>`__ ํ ๊ธฐ๊ธฐ์์ ์ฌ๋ฌ GPU๋ฅผ ์ฌ์ฉํ์ฌ ๋ชจ๋ธ์ ํ์ตํ๋ ๋ฐฉ๋ฒ |
| 39 | +3. `์ฅ์ ๋ด์ฑ ๋ถ์ฐ ํ์ต <ddp_series_fault_tolerance.html>`__ torchrun์ ์ฌ์ฉํ์ฌ ๋ถ์ฐ ํ์ต ์์
์ ๊ฒฌ๊ณ ํ๊ฒ ๋ง๋๋ ๋ฐฉ๋ฒ |
| 40 | +4. `๋ฉํฐ ๋
ธ๋ ํ์ต <../intermediate/ddp_series_multinode.html>`__ ์ฌ๋ฌ ๊ธฐ๊ธฐ์์ ์ฌ๋ฌ GPU๋ฅผ ์ฌ์ฉํ์ฌ ๋ชจ๋ธ์ ํ์ตํ๋ ๋ฐฉ๋ฒ |
| 41 | +5. `DDP๋ฅผ ์ฌ์ฉํ GPT ๋ชจ๋ธ ํ์ต <../intermediate/ddp_series_minGPT.html>`__ DDP๋ฅผ ์ฌ์ฉํ `minGPT <https://github.com/karpathy/minGPT>`__ ๋ชจ๋ธ ํ์ต์ โ์ค์ ์์โ |
0 commit comments