Can't run Llama 8B with higher sequence length and batch size for Multi GPU finetuning on Gaudi 3 #1687

ajscalers · 2025-01-09T12:22:33Z

System Info

Optimum habana version: v1.15.0
Synapse AI version: 1.19.0-2427ed8
Gaudi pytorch container version: vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:1.19.0-561

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Run the Gaudi pytorch container: docker run -it --runtime habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --entrypoint /bin/bash vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:1.19.0-561
Install the required dependencies for the question answering examples directory (in https://github.com/huggingface/optimum-habana/tree/main/examples/question-answering).
Run the example, but change the model to Llama 3.1 8B, increase the batch size and max sequence length:

python ../gaudi_spawn.py \
  --world_size 8 --use_deepspeed run_qa.py \
  --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
  --gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
  --dataset_name squad \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 8 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 512 \
  --doc_stride 128 \
  --output_dir /tmp/squad_output/ \
  --use_habana \
  --use_lazy_mode \
  --use_hpu_graphs_for_inference \
  --throughput_warmup_steps 3 \
  --max_train_samples 45080 \
  --deepspeed ../../tests/configs/deepspeed_zero_2.json \
  --sdp_on_bf16

This throws an out of memory error on Gaudi 3 with 8 GPUs.

Expected behavior

This should run to completion, as a 8B model with 512 sequence length and a batch size of >= 64 can run on other GPUs with similar per GPU memory.

The text was updated successfully, but these errors were encountered:

yafshar · 2025-01-13T17:51:40Z

@ajscalers, have you tried running this example on other GPUs? Many libraries use options like gradient checkpointing by default, which can help reduce memory usage. This issue is not a bug, for this example, please enable that option.

Below are the results on Gaudi2 (Gaudi3 has 128 GB of HBM2e memory, which is 33% more than the 96 GB available in Gaudi2),

>>> python ../gaudi_spawn.py \
  --world_size 8 --use_deepspeed run_qa.py \
  --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
  --gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
  --dataset_name squad \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 8 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 512 \
  --doc_stride 128 \
  --output_dir /tmp/squad_output/ \
  --use_habana \
  --use_lazy_mode \
  --use_hpu_graphs_for_inference \
  --throughput_warmup_steps 3 \
  --max_train_samples 45080 \
  --deepspeed ../../tests/configs/deepspeed_zero_2.json \
  --sdp_on_bf16 \
  --gradient_checkpointing

Training completed. Do not forget to share your model on huggingface.co/models =)

...

***** train metrics *****
  epoch                       =          2.0
  max_memory_allocated (GB)   =         94.2
  memory_allocated (GB)       =        29.96
  total_flos                  = 1804539232GF
  total_memory_available (GB) =        94.62

...

***** eval metrics *****
  epoch                           =        2.0
  eval_exact_match                =     0.0757
  eval_f1                         =     0.6021
  eval_graph_compliation_duration =     4.9983

...

If this resolves your question, please close the issue and remove the bug flag. Thank you!

ajscalers added the bug Something isn't working label Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't run Llama 8B with higher sequence length and batch size for Multi GPU finetuning on Gaudi 3 #1687

Can't run Llama 8B with higher sequence length and batch size for Multi GPU finetuning on Gaudi 3 #1687

ajscalers commented Jan 9, 2025 •

edited

Loading

yafshar commented Jan 13, 2025

Can't run Llama 8B with higher sequence length and batch size for Multi GPU finetuning on Gaudi 3 #1687

Can't run Llama 8B with higher sequence length and batch size for Multi GPU finetuning on Gaudi 3 #1687

Comments

ajscalers commented Jan 9, 2025 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

yafshar commented Jan 13, 2025

ajscalers commented Jan 9, 2025 •

edited

Loading