Questions on large model inference / finetuning #3353

kalyani7195 · 2025-01-18T05:30:02Z

Hi @muellerzr !
I am trying to run Llama 8b model on gpu-a40s using accelerate. I want to first evaluate the model and then add a few trainable parameters and train them. Since the llama 8b checkpoint cannot fit on a single gpu-a40 I am using fsdp configuration. (is it the correct choice?)
when I run accelerate launch the code enters the following method from utils/fsdp_utils.py --

def load_fsdp_model(fsdp_plugin, accelerator, model, input_dir, model_index=0, adapter_only=False):

and then raises the following error:
I went through the documentations -- https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference
as well as https://huggingface.co/docs/accelerate/en/usage_guides/fsdp -- am I missing something here? any help/documentation/tutorial on how to run/finetune/train large models where GPU memory is not sufficient and uses some sort of model sharding using accelerate would be really helpful!!!
Thanks,
Kalyani

[rank3]: Traceback (most recent call last):
[rank3]:   File "/gscratch/zlab/kmarathe/models/xyz/xyz/scripts/llama/xyz_llama3_1_8b_config_extracted.py", line 375, in <module>
[rank3]:     main(args.ablation_config_path)
[rank3]:   File "/gscratch/zlab/kmarathe/models/xyz/xyz/scripts/llama/xyz_llama3_1_8b_config_extracted.py", line 170, in main
[rank3]:     base_exp_name = evaluate_base_model(exp_ids, exp_names, display_names, base_model_args)
[rank3]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/gscratch/zlab/kmarathe/models/xyz/xyz/scripts/llama/xyz_llama3_1_8b_config_extracted.py", line 362, in evaluate_base_model
[rank3]:     train(args)
[rank3]:   File "/gscratch/zlab/kmarathe/models/xyz/xyz/train.py", line 390, in train
[rank3]:     setup_state(tc)
[rank3]:   File "/gscratch/zlab/kmarathe/models/xyz/xyz/train.py", line 234, in setup_state
[rank3]:     load_state(tc.accelerator, tc.state_path)
[rank3]:   File "/gscratch/zlab/kmarathe/models/xyz/xyz/utils.py", line 553, in load_state
[rank3]:     accelerator.load_state(state_path)
[rank3]:   File "/gscratch/zlab/kmarathe/miniconda3/envs/py312/lib/python3.12/site-packages/accelerate/accelerator.py", line 3186, in load_state
[rank3]:     load_fsdp_model(self.state.fsdp_plugin, self, model, input_dir, i)
[rank3]:   File "/gscratch/zlab/kmarathe/miniconda3/envs/py312/lib/python3.12/site-packages/accelerate/utils/fsdp_utils.py", line 144, in load_fsdp_model
[rank3]:     state_dict = torch.load(input_model_file)
[rank3]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/gscratch/zlab/kmarathe/miniconda3/envs/py312/lib/python3.12/site-packages/torch/serialization.py", line 1319, in load
[rank3]:     with _open_file_like(f, "rb") as opened_file:
[rank3]:          ^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/gscratch/zlab/kmarathe/miniconda3/envs/py312/lib/python3.12/site-packages/torch/serialization.py", line 659, in _open_file_like
[rank3]:     return _open_file(name_or_buffer, mode)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/gscratch/zlab/kmarathe/miniconda3/envs/py312/lib/python3.12/site-packages/torch/serialization.py", line 640, in __init__
[rank3]:     super().__init__(open(name, mode))
[rank3]:                      ^^^^^^^^^^^^^^^^
[rank3]: FileNotFoundError: [Errno 2] No such file or directory: '/gscratch/zlab/kmarathe/models/xyz/Experiments/Llama/Llama3_1_8b/experiments/llama3_1_8b_num_experts_ablation/num_experts_64/xyz_runs/c4_llama_dsti_relu_args_num_experts_64_1/state/pytorch_model_fsdp.bin'

The text was updated successfully, but these errors were encountered:

kalyani7195 · 2025-01-18T05:44:59Z

Here is my accelerate environment --

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_activation_checkpointing: false
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

kalyani7195 · 2025-01-19T06:50:40Z

#1890 I think this feature request is highly relevant to my question I think

KeshavSingh29 · 2025-02-06T06:15:58Z

I guess what you are trying to do is somewhat similar to training the model and at some point during training evaluate it.
If so, would recommend using a simpler approach first before more sophisticated ones like FSDP.
A simple setup would be to use ZERO3 optimization where you offload both optimizer and model parameters across GPUs / CPUs to save memory.
If this doesnt work then FSDP might be worth trying but they are both very efficient. (FSDP only is beneficial if you are using 100+ GPUs where communication head between GPU can be significantly high)

kalyani7195 changed the title ~~Questions on large model inference on finetuning~~ Questions on large model inference / finetuning Jan 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on large model inference / finetuning #3353

Questions on large model inference / finetuning #3353

kalyani7195 commented Jan 18, 2025 •

edited

Loading

kalyani7195 commented Jan 18, 2025 •

edited

Loading

kalyani7195 commented Jan 19, 2025

KeshavSingh29 commented Feb 6, 2025

Questions on large model inference / finetuning #3353

Questions on large model inference / finetuning #3353

Comments

kalyani7195 commented Jan 18, 2025 • edited Loading

kalyani7195 commented Jan 18, 2025 • edited Loading

kalyani7195 commented Jan 19, 2025

KeshavSingh29 commented Feb 6, 2025

kalyani7195 commented Jan 18, 2025 •

edited

Loading

kalyani7195 commented Jan 18, 2025 •

edited

Loading