Skip to content

Commit b8d92ec

Browse files
authored
beginner_source/ddp_series_theory.rst ๋ฒˆ์—ญ (#896)
1 parent d933bf7 commit b8d92ec

File tree

1 file changed

+42
-43
lines changed

1 file changed

+42
-43
lines changed

โ€Žbeginner_source/ddp_series_theory.rst

+42-43
Original file line numberDiff line numberDiff line change
@@ -1,70 +1,69 @@
1-
`Introduction <ddp_series_intro.html>`__ \|\| **What is DDP** \|\|
2-
`Single-Node Multi-GPU Training <ddp_series_multigpu.html>`__ \|\|
3-
`Fault Tolerance <ddp_series_fault_tolerance.html>`__ \|\|
4-
`Multi-Node training <../intermediate/ddp_series_multinode.html>`__ \|\|
5-
`minGPT Training <../intermediate/ddp_series_minGPT.html>`__
1+
`์†Œ๊ฐœ <ddp_series_intro.html>`__ \|\| **๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ (DDP) ๋ž€ ๋ฌด์—‡์ธ๊ฐ€?** \|\|
2+
`๋‹จ์ผ ๋…ธ๋“œ ๋‹ค์ค‘-GPU ํ•™์Šต <ddp_series_multigpu.html>`__ \|\|
3+
`๊ฒฐํ•จ ๋‚ด์„ฑ <ddp_series_fault_tolerance.html>`__ \|\|
4+
`๋‹ค์ค‘ ๋…ธ๋“œ ํ•™์Šต <../intermediate/ddp_series_multinode.html>`__ \|\|
5+
`minGPT ํ•™์Šต <../intermediate/ddp_series_minGPT.html>`__
66

7-
What is Distributed Data Parallel (DDP)
7+
๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ (DDP) ๋ž€ ๋ฌด์—‡์ธ๊ฐ€?
88
=======================================
99

10-
Authors: `Suraj Subramanian <https://github.com/suraj813>`__
10+
์ €์ž: `Suraj Subramanian <https://github.com/suraj813>`__
11+
๋ฒˆ์—ญ: `๋ฐ•์ง€์€ <https://github.com/rumjie>`__
1112

1213
.. grid:: 2
1314

14-
.. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
15+
.. grid-item-card:: :octicon:`mortar-board;1em;` ์ด ์žฅ์—์„œ ๋ฐฐ์šฐ๋Š” ๊ฒƒ
1516

16-
* How DDP works under the hood
17-
* What is ``DistributedSampler``
18-
* How gradients are synchronized across GPUs
17+
* DDP ์˜ ๋‚ด๋ถ€ ์ž‘๋™ ์›๋ฆฌ
18+
* ``DistributedSampler`` ์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€?
19+
* GPU ๊ฐ„ ๋ณ€ํ™”๋„๊ฐ€ ๋™๊ธฐํ™”๋˜๋Š” ๋ฐฉ๋ฒ•
1920

2021

21-
.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
22+
.. grid-item-card:: :octicon:`list-unordered;1em;` ํ•„์š” ์‚ฌํ•ญ
2223

23-
* Familiarity with `basic non-distributed training <https://tutorials.pytorch.kr/beginner/basics/quickstart_tutorial.html>`__ in PyTorch
24+
* ํŒŒ์ดํ† ์น˜ `๋น„๋ถ„์‚ฐ ํ•™์Šต <https://tutorials.pytorch.kr/beginner/basics/quickstart_tutorial.html>`__ ์— ์ต์ˆ™ํ•  ๊ฒƒ
2425

25-
Follow along with the video below or on `youtube <https://www.youtube.com/watch/Cvdhwx-OBBo>`__.
26+
์•„๋ž˜์˜ ์˜์ƒ์ด๋‚˜ `์œ ํˆฌ๋ธŒ ์˜์ƒ youtube <https://www.youtube.com/watch/Cvdhwx-OBBo>`__ ์„ ๋”ฐ๋ผ ์ง„ํ–‰ํ•˜์„ธ์š”.
2627

2728
.. raw:: html
2829

2930
<div style="margin-top:10px; margin-bottom:10px;">
3031
<iframe width="560" height="315" src="https://www.youtube.com/embed/Cvdhwx-OBBo" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
3132
</div>
3233

33-
This tutorial is a gentle introduction to PyTorch `DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__ (DDP)
34-
which enables data parallel training in PyTorch. Data parallelism is a way to
35-
process multiple data batches across multiple devices simultaneously
36-
to achieve better performance. In PyTorch, the `DistributedSampler <https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler>`__
37-
ensures each device gets a non-overlapping input batch. The model is replicated on all the devices;
38-
each replica calculates gradients and simultaneously synchronizes with the others using the `ring all-reduce
39-
algorithm <https://tech.preferred.jp/en/blog/technologies-behind-distributed-deep-learning-allreduce/>`__.
34+
์ด ํŠœํ† ๋ฆฌ์–ผ์€ ํŒŒ์ดํ† ์น˜์—์„œ ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” `๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__ (DDP)
35+
์— ๋Œ€ํ•ด ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ž€ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด
36+
์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋””๋ฐ”์ด์Šค์—์„œ ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ๋ฐฐ์น˜๋“ค์„ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
37+
ํŒŒ์ดํ† ์น˜์—์„œ, `๋ถ„์‚ฐ ์ƒ˜ํ”Œ๋Ÿฌ <https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler>`__ ๋Š”
38+
๊ฐ ๋””๋ฐ”์ด์Šค๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ์ž…๋ ฅ ๋ฐฐ์น˜๋ฅผ ๋ฐ›๋Š” ๊ฒƒ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.
39+
๋ชจ๋ธ์€ ๋ชจ๋“  ๋””๋ฐ”์ด์Šค์— ๋ณต์ œ๋˜๋ฉฐ, ๊ฐ ์‚ฌ๋ณธ์€ ๋ณ€ํ™”๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋™์‹œ์— `Ring-All-Reduce
40+
์•Œ๊ณ ๋ฆฌ์ฆ˜ <https://tech.preferred.jp/en/blog/technologies-behind-distributed-deep-learning-allreduce/>`__ ์„ ์‚ฌ์šฉํ•ด ๋‹ค๋ฅธ ์‚ฌ๋ณธ๊ณผ ๋™๊ธฐํ™”๋ฉ๋‹ˆ๋‹ค.
4041

41-
This `illustrative tutorial <https://tutorials.pytorch.kr/intermediate/dist_tuto.html#>`__ provides a more in-depth python view of the mechanics of DDP.
42+
`์˜ˆ์‹œ ํŠœํ† ๋ฆฌ์–ผ <https://tutorials.pytorch.kr/intermediate/dist_tuto.html#>`__ ์—์„œ DDP ๋ฉ”์ปค๋‹ˆ์ฆ˜์— ๋Œ€ํ•ด ํŒŒ์ด์ฌ ๊ด€์ ์—์„œ ์‹ฌ๋„ ์žˆ๋Š” ์„ค๋ช…์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
4243

43-
Why you should prefer DDP over ``DataParallel`` (DP)
44+
``๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ DataParallel`` (DP) ๋ณด๋‹ค DDP๊ฐ€ ๋‚˜์€ ์ด์œ 
4445
----------------------------------------------------
4546

46-
`DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__
47-
is an older approach to data parallelism. DP is trivially simple (with just one extra line of code) but it is much less performant.
48-
DDP improves upon the architecture in a few ways:
49-
50-
+---------------------------------------+------------------------------+
51-
| ``DataParallel`` | ``DistributedDataParallel`` |
52-
+=======================================+==============================+
53-
| More overhead; model is replicated | Model is replicated only |
54-
| and destroyed at each forward pass | once |
55-
+---------------------------------------+------------------------------+
56-
| Only supports single-node parallelism | Supports scaling to multiple |
57-
| | machines |
58-
+---------------------------------------+------------------------------+
59-
| Slower; uses multithreading on a | Faster (no GIL contention) |
60-
| single process and runs into Global | because it uses |
61-
| Interpreter Lock (GIL) contention | multiprocessing |
62-
+---------------------------------------+------------------------------+
63-
64-
Further Reading
47+
`DP <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__ ๋Š” ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์˜ ์ด์ „ ์ ‘๊ทผ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
48+
DP ๋Š” ๊ฐ„๋‹จํ•˜์ง€๋งŒ, (ํ•œ ์ค„๋งŒ ์ถ”๊ฐ€ํ•˜๋ฉด ๋จ) ์„ฑ๋Šฅ์€ ํ›จ์”ฌ ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. DDP๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค.
49+
50+
.. list-table::
51+
:header-rows: 1
52+
53+
* - ``DataParallel``
54+
- ``DistributedDataParallel``
55+
* - ์ž‘์—… ๋ถ€ํ•˜๊ฐ€ ํผ, ์ „ํŒŒ๋  ๋•Œ๋งˆ๋‹ค ๋ชจ๋ธ์ด ๋ณต์ œ ๋ฐ ์‚ญ์ œ๋จ
56+
- ๋ชจ๋ธ์ด ํ•œ ๋ฒˆ๋งŒ ๋ณต์ œ๋จ
57+
* - ๋‹จ์ผ ๋…ธ๋“œ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋งŒ ๊ฐ€๋Šฅ
58+
- ์—ฌ๋Ÿฌ ๋จธ์‹ ์œผ๋กœ ํ™•์žฅ ๊ฐ€๋Šฅ
59+
* - ๋Š๋ฆผ, ๋‹จ์ผ ํ”„๋กœ์„ธ์Šค์—์„œ ๋ฉ€ํ‹ฐ ์Šค๋ ˆ๋”ฉ์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— Global Interpreter Lock (GIL) ์ถฉ๋Œ์ด ๋ฐœ์ƒ
60+
- ๋น ๋ฆ„, ๋ฉ€ํ‹ฐ ํ”„๋กœ์„ธ์‹ฑ์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— GIL ์ถฉ๋Œ ์—†์Œ
61+
62+
63+
์ฝ์„๊ฑฐ๋ฆฌ
6564
---------------
6665

67-
- `Multi-GPU training with DDP <ddp_series_multigpu.html>`__ (next tutorial in this series)
66+
- `Multi-GPU training with DDP <ddp_series_multigpu.html>`__ (์ด ์‹œ๋ฆฌ์ฆˆ์˜ ๋‹ค์Œ ํŠœํ† ๋ฆฌ์–ผ)
6867
- `DDP
6968
API <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
7069
- `DDP Internal

0 commit comments

Comments
ย (0)