Skip to content

Latest commit

ย 

History

History
84 lines (61 loc) ยท 4.98 KB

ddp_series_multinode.rst

File metadata and controls

84 lines (61 loc) ยท 4.98 KB

์†Œ๊ฐœ || ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ (DDP) ๋ž€ ๋ฌด์—‡์ธ๊ฐ€? || ๋‹จ์ผ ๋…ธ๋“œ ๋‹ค์ค‘-GPU ํ•™์Šต || ๊ฒฐํ•จ ๋‚ด์„ฑ || ๋‹ค์ค‘ ๋…ธ๋“œ (Multinode) ํ•™์Šต || minGPT ํ•™์Šต

๋ฉ€ํ‹ฐ๋…ธ๋“œ(Multinode) ํ•™์Šต

์ €์ž: Suraj Subramanian ๋ฒˆ์—ญ: ๋ฐ•์ง€์€

.. grid:: 2

   .. grid-item-card:: :octicon:`mortar-board;1em;` ์ด ์žฅ์—์„œ ๋ฐฐ์šฐ๋Š” ๊ฒƒ

      - ``torchrun`` ์œผ๋กœ ๋ฉ€ํ‹ฐ๋…ธ๋“œ ํ•™์Šต ์‹œ์ž‘ํ•˜๊ธฐ
      - ์‹ฑ๊ธ€๋…ธ๋“œ์—์„œ ๋ฉ€ํ‹ฐ๋…ธ๋“œ ํ•™์Šต์œผ๋กœ ์˜ฎ๊ธฐ๊ธฐ ์œ„ํ•œ ์ฝ”๋“œ ๋ณ€๊ฒฝ (๋ฐ ์—ผ๋‘์— ๋‘์–ด์•ผ ํ•˜๋Š” ๊ฒƒ๋“ค)

      .. grid:: 1

         .. grid-item::

            :octicon:`code-square;1.0em;` ์ด ํŠœํ† ๋ฆฌ์–ผ์— ์‚ฌ์šฉ๋œ ์ฝ”๋“œ ์ฐธ๊ณ  - `GitHub <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multinode.py>`__

   .. grid-item-card:: :octicon:`list-unordered;1em;` ํ•„์š” ์‚ฌํ•ญ

      - `๋‹ค์ค‘ GPU ํ•™์Šต <../beginner/ddp_series_multigpu.html>`__ ๊ณผ `torchrun <../beginner/ddp_series_fault_tolerance.html>`__ ์— ์ต์ˆ™ํ•  ๊ฒƒ
      - 2๊ฐœ ์ด์ƒ์˜ TCP ์ ‘๊ทผ์ด ๊ฐ€๋Šฅํ•œ GPU ๋จธ์‹  (๋ณธ ํŠœํ† ๋ฆฌ์–ผ์—์„œ๋Š” AWS p3.2xlarge๋ฅผ ์‚ฌ์šฉํ•จ)
      - ๋ชจ๋“  ๋จธ์‹ ์— CUDA๊ฐ€ ์„ค์น˜๋œ `ํŒŒ์ดํ† ์น˜ <https://pytorch.org/get-started/locally/>`__

์•„๋ž˜์˜ ์˜์ƒ์ด๋‚˜ ์œ ํŠœ๋ธŒ ์˜์ƒ ์„ ๋”ฐ๋ผ ์ง„ํ–‰ํ•˜์„ธ์š”.

๋ฉ€ํ‹ฐ๋…ธ๋“œ ํ•™์Šต์€ ์—ฌ๋Ÿฌ ๋Œ€์˜ ๋จธ์‹ ์— ํ•™์Šต ์ž‘์—…์„ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์‹คํ–‰์˜ ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ๊ฐ ๋จธ์‹ ์—์„œ ๋™์ผํ•œ rendezvous ์ธ์ˆ˜๋กœ torchrun ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜๊ธฐ
  • SLURM ๊ณผ ๊ฐ™์€ ์›Œํฌ๋กœ๋“œ ๋งค๋‹ˆ์ € ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ปดํ“จํ„ฐ ํด๋Ÿฌ์Šคํ„ฐ์— ๋ฐฐํฌํ•˜๊ธฐ

์ด ์˜์ƒ์—์„œ๋Š” ์‹ฑ๊ธ€๋…ธ๋“œ ๋‹ค์ค‘ GPU ๋กœ๋ถ€ํ„ฐ ๋ฉ€ํ‹ฐ๋…ธ๋“œ ํ•™์Šต์œผ๋กœ ์˜ฎ๊ธฐ๊ธฐ ์œ„ํ•œ (์ตœ์†Œํ•œ์˜) ์ฝ”๋“œ ๋ณ€๊ฒฝ์„ ๋‹ค๋ฃจ๊ณ , ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์˜ ํ•™์Šต ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์‹คํ–‰ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋ฉ€ํ‹ฐ๋…ธ๋“œ ํ•™์Šต์€ ๋…ธ๋“œ ๊ฐ„ ํ†ต์‹  ์ง€์—ฐ์œผ๋กœ ์ธํ•ด ๋ณ‘๋ชฉ ํ˜„์ƒ์ด ๋ฐœ์ƒํ•œ๋‹ค๋Š” ์ ์„ ์œ ์˜ํ•˜์‹ญ์‹œ์˜ค. ์‹ฑ๊ธ€๋…ธ๋“œ์—์„œ 4๊ฐœ์˜ GPU๋ฅผ ์‚ฌ์šฉํ•œ ํ•™์Šต ์ž‘์—…์ด 4๊ฐœ์˜ ๋…ธ๋“œ์—์„œ 1๊ฐœ์˜ GPU๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒƒ๋ณด๋‹ค ๋น ๋ฅผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋กœ์ปฌ ์ˆœ์œ„์™€ ๊ธ€๋กœ๋ฒŒ ์ˆœ์œ„ Local and Global ranks

์‹ฑ๊ธ€๋…ธ๋“œ๋ฅผ ์„ค์ •ํ•  ๋•Œ, ํ•™์Šต ํ”„๋กœ์„ธ์Šค์˜ ๊ฐ ์žฅ์น˜์˜
gpu_id ๊ฐ€ ๊ธฐ๋ก๋˜๊ณ  ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. torchrun ์€ ์ด ๊ฐ’์„ ํ™˜๊ฒฝ ๋ณ€์ˆ˜ LOCAL_RANK ๋กœ ๊ธฐ๋กํ•˜๊ณ ,

์ด๋Š” ๋…ธ๋“œ์—์„œ ๊ฐ๊ฐ์˜ ๊ณ ์œ ํ•œ GPU ํ”„๋กœ์„ธ์Šค๋ฅผ ์‹๋ณ„ํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ’์ž…๋‹ˆ๋‹ค. For a unique identifier across all the nodes, torchrun provides another variable RANK which refers to the global rank of a process.

.. ์ฃผ์˜์‚ฌํ•ญ::
   ํ•™์Šต ์‹œ ์ค‘์š”ํ•œ ๋กœ์ง์— ``์ˆœ์œ„`` ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ๋งˆ์‹ญ์‹œ์˜ค. ``torchrun``์˜ ์‹คํŒจ ํ˜น์€ ๋ฉค๋ฒ„์‹ญ์˜ ๋ณ€๊ฒฝ์œผ๋กœ ์ธํ•ด ์žฌ์‹œ์ž‘๋˜๋ฉด ํ•ด๋‹น ํ”„๋กœ์„ธ์Šค์—์„œ
   ๊ฐ™์€ ``๋กœ์ปฌ ์ˆœ์œ„`` ์™€ ``์ˆœ์œ„`` ๊ฐ€ ์œ ์ง€๋œ๋‹ค๋Š” ๋ณด์žฅ์ด ์—†์Šต๋‹ˆ๋‹ค.

์ด์งˆ์  ์Šค์ผ€์ผ๋ง

Torchrun ์€ ์ด์งˆ์  ์Šค์ผ€์ผ๋ง ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๊ฐ๊ฐ์˜ ๋ฉ€ํ‹ฐ๋…ธ๋“œ ๋จธ์‹ ์ด ํ•™์Šต์— ์ฐธ์—ฌํ•˜๋Š” GPU์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋น„๋””์˜ค์—์„œ๋Š” 2 ๋Œ€์˜ ๋จธ์‹ ์— ์ฝ”๋“œ๋ฅผ ๋ฐฐํฌํ•˜์—ฌ ํ•œ ๊ฐœ์˜ ๋จธ์‹ ์—๋Š” 4๊ฐœ, ๋‹ค๋ฅธ ํ•œ ๊ฐœ์˜ ๋จธ์‹ ์—๋Š” 2๊ฐœ์˜ GPU๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋ฌธ์ œ ํ•ด๊ฒฐ

  • ๋…ธ๋“œ๋“ค์ด TCP๋ฅผ ํ†ตํ•ด ์„œ๋กœ ํ†ต์‹ ์ด ๊ฐ€๋Šฅํ•œ์ง€ ํ™•์ธํ•˜์„ธ์š”.
  • ํ™˜๊ฒฝ ๋ณ€์ˆ˜ NCCL_DEBUG ๋ฅผ INFO ๋กœ ์„ค์ •ํ•˜์—ฌ (๋ช…๋ น์–ด: export NCCL_DEBUG=INFO) ์ด์Šˆ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋Š” ์ƒ์„ธ ๋กœ๊ทธ๋ฅผ ์ถœ๋ ฅํ•˜์„ธ์š”.
  • ๋ถ„์‚ฐ ๋ฐฑ์—”๋“œ๋ฅผ ์œ„ํ•ด ๋ช…์‹œ์ ์ธ ๋„คํŠธ์›Œํฌ ์ธํ„ฐํŽ˜์ด์Šค ์„ค์ •์ด ํ•„์š”ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. (export NCCL_SOCKET_IFNAME=eth0). ์ด ๋งํฌ. ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

์ฝ์„๊ฑฐ๋ฆฌ