Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

beginner_source/ddp_series_intro.rst 번역 #892

Merged
merged 5 commits into from
Oct 15, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 32 additions & 46 deletions beginner_source/ddp_series_intro.rst
Original file line number Diff line number Diff line change
@@ -1,56 +1,42 @@
**Introduction** \|\| `What is DDP <ddp_series_theory.html>`__ \|\|
`Single-Node Multi-GPU Training <ddp_series_multigpu.html>`__ \|\|
`Fault Tolerance <ddp_series_fault_tolerance.html>`__ \|\|
`Multi-Node training <../intermediate/ddp_series_multinode.html>`__ \|\|
`minGPT Training <../intermediate/ddp_series_minGPT.html>`__
**소개** \|\| `DDP란 무엇인가 <ddp_series_theory.html>`__ \|\|
`단일 노드 다중-GPU 학습 <ddp_series_multigpu.html>`__ \|\|
`결함 내성 <ddp_series_fault_tolerance.html>`__ \|\|
`다중 노드 학습 <../intermediate/ddp_series_multinode.html>`__ \|\|
`minGPT 학습 <../intermediate/ddp_series_minGPT.html>`__

Distributed Data Parallel in PyTorch - Video Tutorials
======================================================
PyTorch의 분산 데이터 병렬 처리 - 비디오 튜토리얼
=====================================================

Authors: `Suraj Subramanian <https://github.com/suraj813>`__
저자: `Suraj Subramanian <https://github.com/suraj813>`__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

번역자 정보를 아래줄에 넣어주세요

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

추가했습니다.

번역: `송호준 <https://github.com/hojunking>`_

Follow along with the video below or on `youtube <https://www.youtube.com/watch/-K3bZYHYHEA>`__.
아래 비디오를 보거나 `YouTube <https://www.youtube.com/watch/-K3bZYHYHEA>`__에서도 보실 수 있습니다.

.. raw:: html

<div style="margin-top:10px; margin-bottom:10px;">
<iframe width="560" height="315" src="https://www.youtube.com/embed/-K3bZYHYHEA" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>

This series of video tutorials walks you through distributed training in
PyTorch via DDP.

The series starts with a simple non-distributed training job, and ends
with deploying a training job across several machines in a cluster.
Along the way, you will also learn about
`torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__ for
fault-tolerant distributed training.

The tutorial assumes a basic familiarity with model training in PyTorch.

Running the code
----------------

You will need multiple CUDA GPUs to run the tutorial code. Typically,
this can be done on a cloud instance with multiple GPUs (the tutorials
use an Amazon EC2 P3 instance with 4 GPUs).

The tutorial code is hosted in this
`github repo <https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series>`__.
Clone the repository and follow along!

Tutorial sections
-----------------

0. Introduction (this page)
1. `What is DDP? <ddp_series_theory.html>`__ Gently introduces what DDP is doing
under the hood
2. `Single-Node Multi-GPU Training <ddp_series_multigpu.html>`__ Training models
using multiple GPUs on a single machine
3. `Fault-tolerant distributed training <ddp_series_fault_tolerance.html>`__
Making your distributed training job robust with torchrun
4. `Multi-Node training <../intermediate/ddp_series_multinode.html>`__ Training models using
multiple GPUs on multiple machines
5. `Training a GPT model with DDP <../intermediate/ddp_series_minGPT.html>`__ “Real-world”
example of training a `minGPT <https://github.com/karpathy/minGPT>`__
model with DDP
이 비디오 튜토리얼 시리즈는 PyTorch에서 DDP(Distributed Data Parallel)를 사용한 분산 학습에 대해 안내합니다.

이 시리즈는 단순한 비분산 학습 작업에서 시작하여, 클러스터 내 여러 기기들(multiple machines)에서 학습 작업을 배포하는 것으로 마무리됩니다. 이 과정에서 `torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__을 사용한 결함 내성(fault-tolerant) 분산 학습에 대해서도 배우게 될 예정입니다.

이 튜토리얼은 PyTorch에서 모델 학습에 대한 기본적인 이해를 전제로 하고 있습니다.

코드 실행
--------

튜토리얼 코드를 실행하려면 여러 개의 CUDA GPU가 필요합니다. 일반적으로 여러 GPU가 있는 클라우드 인스턴스에서 이를 수행할 수 있으며, 튜토리얼에서는 4개의 GPU가 탑재된 Amazon EC2 P3 인스턴스를 사용합니다.

튜토리얼 코드는 이 `GitHub 저장소 <https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series>`__에 올라와 있습니다. 저장소를 복제하고 함께 진행하세요!

튜토리얼 섹션
--------------

0. 소개 (이 페이지)
1. `DDP란 무엇인가? <ddp_series_theory.html>`__ DDP가 내부적으로 수행하는 작업에 대해 간단히 소개
2. `단일 노드 멀티-GPU 학습 <ddp_series_multigpu.html>`__ 한 기기에서 여러 GPU를 사용하여 모델을 학습하는 방법
3. `결함 내성 분산 학습 <ddp_series_fault_tolerance.html>`__ torchrun을 사용하여 분산 학습 작업을 견고하게 만드는 방법
4. `다중 노드 학습 <../intermediate/ddp_series_multinode.html>`__ 여러 기기에서 여러 GPU를 사용하여 모델을 학습하는 방법
5. `DDP를 사용한 GPT 모델 학습 <../intermediate/ddp_series_minGPT.html>`__ DDP를 사용한 `minGPT <https://github.com/karpathy/minGPT>`__ 모델 학습의 “실제 예시”