Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't run Llama 8B with higher sequence length and batch size for Multi GPU finetuning on Gaudi 3 #1687

Open
2 of 4 tasks
ajscalers opened this issue Jan 9, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@ajscalers
Copy link

ajscalers commented Jan 9, 2025

System Info

Optimum habana version: v1.15.0
Synapse AI version: 1.19.0-2427ed8
Gaudi pytorch container version: vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:1.19.0-561

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Run the Gaudi pytorch container: docker run -it --runtime habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --entrypoint /bin/bash vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:1.19.0-561
  2. Install the required dependencies for the question answering examples directory (in https://github.com/huggingface/optimum-habana/tree/main/examples/question-answering).
  3. Run the example, but change the model to Llama 3.1 8B, increase the batch size and max sequence length:
python ../gaudi_spawn.py \
  --world_size 8 --use_deepspeed run_qa.py \
  --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
  --gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
  --dataset_name squad \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 8 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 512 \
  --doc_stride 128 \
  --output_dir /tmp/squad_output/ \
  --use_habana \
  --use_lazy_mode \
  --use_hpu_graphs_for_inference \
  --throughput_warmup_steps 3 \
  --max_train_samples 45080 \
  --deepspeed ../../tests/configs/deepspeed_zero_2.json \
  --sdp_on_bf16

This throws an out of memory error on Gaudi 3 with 8 GPUs.

Expected behavior

This should run to completion, as a 8B model with 512 sequence length and a batch size of >= 64 can run on other GPUs with similar per GPU memory.

@ajscalers ajscalers added the bug Something isn't working label Jan 9, 2025
@yafshar
Copy link
Contributor

yafshar commented Jan 13, 2025

@ajscalers, have you tried running this example on other GPUs? Many libraries use options like gradient checkpointing by default, which can help reduce memory usage. This issue is not a bug, for this example, please enable that option.

Below are the results on Gaudi2 (Gaudi3 has 128 GB of HBM2e memory, which is 33% more than the 96 GB available in Gaudi2),

>>> python ../gaudi_spawn.py \
  --world_size 8 --use_deepspeed run_qa.py \
  --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
  --gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
  --dataset_name squad \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 8 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 512 \
  --doc_stride 128 \
  --output_dir /tmp/squad_output/ \
  --use_habana \
  --use_lazy_mode \
  --use_hpu_graphs_for_inference \
  --throughput_warmup_steps 3 \
  --max_train_samples 45080 \
  --deepspeed ../../tests/configs/deepspeed_zero_2.json \
  --sdp_on_bf16 \
  --gradient_checkpointing

Training completed. Do not forget to share your model on huggingface.co/models =)

...

***** train metrics *****
  epoch                       =          2.0
  max_memory_allocated (GB)   =         94.2
  memory_allocated (GB)       =        29.96
  total_flos                  = 1804539232GF
  total_memory_available (GB) =        94.62

...

***** eval metrics *****
  epoch                           =        2.0
  eval_exact_match                =     0.0757
  eval_f1                         =     0.6021
  eval_graph_compliation_duration =     4.9983

...

If this resolves your question, please close the issue and remove the bug flag. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants