Skip to content

Commit d933bf7

Browse files
authored
beginner_source/ddp_series_intro.rst ๋ฒˆ์—ญ (#892)
1 parent 9b1ac85 commit d933bf7

File tree

1 file changed

+32
-46
lines changed

1 file changed

+32
-46
lines changed
+32-46
Original file line numberDiff line numberDiff line change
@@ -1,56 +1,42 @@
1-
**Introduction** \|\| `What is DDP <ddp_series_theory.html>`__ \|\|
2-
`Single-Node Multi-GPU Training <ddp_series_multigpu.html>`__ \|\|
3-
`Fault Tolerance <ddp_series_fault_tolerance.html>`__ \|\|
4-
`Multi-Node training <../intermediate/ddp_series_multinode.html>`__ \|\|
5-
`minGPT Training <../intermediate/ddp_series_minGPT.html>`__
1+
**์†Œ๊ฐœ** \|\| `DDP๋ž€ ๋ฌด์—‡์ธ๊ฐ€ <ddp_series_theory.html>`__ \|\|
2+
`๋‹จ์ผ ๋…ธ๋“œ ๋‹ค์ค‘-GPU ํ•™์Šต <ddp_series_multigpu.html>`__ \|\|
3+
`๊ฒฐํ•จ ๋‚ด์„ฑ <ddp_series_fault_tolerance.html>`__ \|\|
4+
`๋‹ค์ค‘ ๋…ธ๋“œ ํ•™์Šต <../intermediate/ddp_series_multinode.html>`__ \|\|
5+
`minGPT ํ•™์Šต <../intermediate/ddp_series_minGPT.html>`__
66

7-
Distributed Data Parallel in PyTorch - Video Tutorials
8-
======================================================
7+
PyTorch์˜ ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ - ๋น„๋””์˜ค ํŠœํ† ๋ฆฌ์–ผ
8+
=====================================================
99

10-
Authors: `Suraj Subramanian <https://github.com/suraj813>`__
10+
์ €์ž: `Suraj Subramanian <https://github.com/suraj813>`__
11+
๋ฒˆ์—ญ: `์†กํ˜ธ์ค€ <https://github.com/hojunking>`_
1112

12-
Follow along with the video below or on `youtube <https://www.youtube.com/watch/-K3bZYHYHEA>`__.
13+
์•„๋ž˜ ๋น„๋””์˜ค๋ฅผ ๋ณด๊ฑฐ๋‚˜ `YouTube <https://www.youtube.com/watch/-K3bZYHYHEA>`__์—์„œ๋„ ๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
1314

1415
.. raw:: html
1516

1617
<div style="margin-top:10px; margin-bottom:10px;">
1718
<iframe width="560" height="315" src="https://www.youtube.com/embed/-K3bZYHYHEA" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
1819
</div>
1920

20-
This series of video tutorials walks you through distributed training in
21-
PyTorch via DDP.
22-
23-
The series starts with a simple non-distributed training job, and ends
24-
with deploying a training job across several machines in a cluster.
25-
Along the way, you will also learn about
26-
`torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__ for
27-
fault-tolerant distributed training.
28-
29-
The tutorial assumes a basic familiarity with model training in PyTorch.
30-
31-
Running the code
32-
----------------
33-
34-
You will need multiple CUDA GPUs to run the tutorial code. Typically,
35-
this can be done on a cloud instance with multiple GPUs (the tutorials
36-
use an Amazon EC2 P3 instance with 4 GPUs).
37-
38-
The tutorial code is hosted in this
39-
`github repo <https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series>`__.
40-
Clone the repository and follow along!
41-
42-
Tutorial sections
43-
-----------------
44-
45-
0. Introduction (this page)
46-
1. `What is DDP? <ddp_series_theory.html>`__ Gently introduces what DDP is doing
47-
under the hood
48-
2. `Single-Node Multi-GPU Training <ddp_series_multigpu.html>`__ Training models
49-
using multiple GPUs on a single machine
50-
3. `Fault-tolerant distributed training <ddp_series_fault_tolerance.html>`__
51-
Making your distributed training job robust with torchrun
52-
4. `Multi-Node training <../intermediate/ddp_series_multinode.html>`__ Training models using
53-
multiple GPUs on multiple machines
54-
5. `Training a GPT model with DDP <../intermediate/ddp_series_minGPT.html>`__ โ€œReal-worldโ€
55-
example of training a `minGPT <https://github.com/karpathy/minGPT>`__
56-
model with DDP
21+
์ด ๋น„๋””์˜ค ํŠœํ† ๋ฆฌ์–ผ ์‹œ๋ฆฌ์ฆˆ๋Š” PyTorch์—์„œ DDP(Distributed Data Parallel)๋ฅผ ์‚ฌ์šฉํ•œ ๋ถ„์‚ฐ ํ•™์Šต์— ๋Œ€ํ•ด ์•ˆ๋‚ดํ•ฉ๋‹ˆ๋‹ค.
22+
23+
์ด ์‹œ๋ฆฌ์ฆˆ๋Š” ๋‹จ์ˆœํ•œ ๋น„๋ถ„์‚ฐ ํ•™์Šต ์ž‘์—…์—์„œ ์‹œ์ž‘ํ•˜์—ฌ, ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด ์—ฌ๋Ÿฌ ๊ธฐ๊ธฐ๋“ค(multiple machines)์—์„œ ํ•™์Šต ์ž‘์—…์„ ๋ฐฐํฌํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋งˆ๋ฌด๋ฆฌ๋ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ `torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__์„ ์‚ฌ์šฉํ•œ ๊ฒฐํ•จ ๋‚ด์„ฑ(fault-tolerant) ๋ถ„์‚ฐ ํ•™์Šต์— ๋Œ€ํ•ด์„œ๋„ ๋ฐฐ์šฐ๊ฒŒ ๋  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.
24+
25+
์ด ํŠœํ† ๋ฆฌ์–ผ์€ PyTorch์—์„œ ๋ชจ๋ธ ํ•™์Šต์— ๋Œ€ํ•œ ๊ธฐ๋ณธ์ ์ธ ์ดํ•ด๋ฅผ ์ „์ œ๋กœ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
26+
27+
์ฝ”๋“œ ์‹คํ–‰
28+
--------
29+
30+
ํŠœํ† ๋ฆฌ์–ผ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ ค๋ฉด ์—ฌ๋Ÿฌ ๊ฐœ์˜ CUDA GPU๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ์—ฌ๋Ÿฌ GPU๊ฐ€ ์žˆ๋Š” ํด๋ผ์šฐ๋“œ ์ธ์Šคํ„ด์Šค์—์„œ ์ด๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ํŠœํ† ๋ฆฌ์–ผ์—์„œ๋Š” 4๊ฐœ์˜ GPU๊ฐ€ ํƒ‘์žฌ๋œ Amazon EC2 P3 ์ธ์Šคํ„ด์Šค๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
31+
32+
ํŠœํ† ๋ฆฌ์–ผ ์ฝ”๋“œ๋Š” ์ด `GitHub ์ €์žฅ์†Œ <https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series>`__์— ์˜ฌ๋ผ์™€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ €์žฅ์†Œ๋ฅผ ๋ณต์ œํ•˜๊ณ  ํ•จ๊ป˜ ์ง„ํ–‰ํ•˜์„ธ์š”!
33+
34+
ํŠœํ† ๋ฆฌ์–ผ ์„น์…˜
35+
--------------
36+
37+
0. ์†Œ๊ฐœ (์ด ํŽ˜์ด์ง€)
38+
1. `DDP๋ž€ ๋ฌด์—‡์ธ๊ฐ€? <ddp_series_theory.html>`__ DDP๊ฐ€ ๋‚ด๋ถ€์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ์ž‘์—…์— ๋Œ€ํ•ด ๊ฐ„๋‹จํžˆ ์†Œ๊ฐœ
39+
2. `๋‹จ์ผ ๋…ธ๋“œ ๋ฉ€ํ‹ฐ-GPU ํ•™์Šต <ddp_series_multigpu.html>`__ ํ•œ ๊ธฐ๊ธฐ์—์„œ ์—ฌ๋Ÿฌ GPU๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•
40+
3. `๊ฒฐํ•จ ๋‚ด์„ฑ ๋ถ„์‚ฐ ํ•™์Šต <ddp_series_fault_tolerance.html>`__ torchrun์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„์‚ฐ ํ•™์Šต ์ž‘์—…์„ ๊ฒฌ๊ณ ํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•
41+
4. `๋‹ค์ค‘ ๋…ธ๋“œ ํ•™์Šต <../intermediate/ddp_series_multinode.html>`__ ์—ฌ๋Ÿฌ ๊ธฐ๊ธฐ์—์„œ ์—ฌ๋Ÿฌ GPU๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•
42+
5. `DDP๋ฅผ ์‚ฌ์šฉํ•œ GPT ๋ชจ๋ธ ํ•™์Šต <../intermediate/ddp_series_minGPT.html>`__ DDP๋ฅผ ์‚ฌ์šฉํ•œ `minGPT <https://github.com/karpathy/minGPT>`__ ๋ชจ๋ธ ํ•™์Šต์˜ โ€œ์‹ค์ œ ์˜ˆ์‹œโ€

0 commit comments

Comments
ย (0)