Skip to content

Commit

Permalink
yaml
Browse files Browse the repository at this point in the history
  • Loading branch information
zzhhjjj committed Apr 30, 2024
1 parent 8f01f82 commit 7033d24
Show file tree
Hide file tree
Showing 5 changed files with 7 additions and 6 deletions.
1 change: 1 addition & 0 deletions .github/workflows/3d_parallelism_unit_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ jobs:
--durations=0 \
--ignore tests/kernels \
--ignore tests/fp8 \
--ignore tests/test_train_llama.py \
--verbose \
tests/
# NOTE: T4 can't run FA2, DoReMi's LLaMa needs FÀ
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/llama_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ on:
jobs:
tests:
# NOTE: 8-a10 to run LLama
runs-on: [multi-gpu, nvidia-gpu, 8-a10, ci]
runs-on: [multi-gpu, nvidia-gpu, 4-a10, ci]
container:
image: runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04
ports:
Expand Down
4 changes: 2 additions & 2 deletions examples/config_train_llama.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@
)

parallelism = ParallelismArgs(
dp=4,
dp=2,
pp=1,
tp=2,
pp_engine="1f1b",
Expand All @@ -82,7 +82,7 @@
)

# a global batch-size of 1M tokens. micro_batch_size * dp * sequence_length * batch_accumulation_per_replica
tokens = TokensArgs(sequence_length=512, train_steps=200, micro_batch_size=128, batch_accumulation_per_replica=4)
tokens = TokensArgs(sequence_length=512, train_steps=200, micro_batch_size=128, batch_accumulation_per_replica=8)

checkpoints_path = os.path.dirname(os.path.dirname(__file__)) + "/checkpoints"
os.makedirs(checkpoints_path, exist_ok=True)
Expand Down
4 changes: 2 additions & 2 deletions examples/config_train_llama.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ optimizer:
weight_decay: 0.01
zero_stage: 0
parallelism:
dp: 4
dp: 2
expert_parallel_size: 1
pp: 1
pp_engine: 1f1b
Expand All @@ -88,7 +88,7 @@ tokenizer:
tokenizer_name_or_path: gpt2
tokenizer_revision: null
tokens:
batch_accumulation_per_replica: 4
batch_accumulation_per_replica: 8
limit_test_batches: 0
limit_val_batches: 0
micro_batch_size: 128
Expand Down
2 changes: 1 addition & 1 deletion tests/test_train_llama.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
CONFIG_FILE = "examples/config_train_llama.yaml"
CREATE_CONFIG_FILE = "examples/config_train_llama.py"
TRAIN_SCRIPT = "run_train.py"
NUM_GPUS = 8
NUM_GPUS = 4

## 100+ steps: lm_loss < 3.5
## 200 steps: lm_loss < 3
Expand Down

0 comments on commit 7033d24

Please sign in to comment.