Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize model with empty weight causes OOM with offloading to disk #3374

Open
2 of 4 tasks
Aiden-Frost opened this issue Feb 1, 2025 · 0 comments
Open
2 of 4 tasks

Comments

@Aiden-Frost
Copy link

Aiden-Frost commented Feb 1, 2025

System Info (WRONG CONFIG COPIED, UPDATING THE CORRECT ONE, OVERRIDE launch config with accelerate launch)

- `Accelerate` version: 1.3.0
- Platform: Linux-5.14.0-284.86.1.el9_2.x86_64-x86_64-with-glibc2.35
- `accelerate` bash location: /ext3/miniforge3/bin/accelerate
- Python version: 3.12.7
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.5.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 377.07 GB
- GPU type: Tesla V100-PCIE-32GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - debug: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: 0,1,2,3
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

I am trying to run Deepseek R1 in bf16 on 4 A100 GPU (with offloading to CPU 400GB memory and disk 5TB). I am running this job using slurm workload manager.
This is how I am loading the model

    accelerator = Accelerator()
    torch.set_default_dtype(torch.bfloat16)
    torch.manual_seed(42)

    with open(config) as f:
        args = ModelArgs(**json.load(f))   

    with init_empty_weights():
        model = Transformer(args)

    model = load_checkpoint_and_dispatch(
        model,
        os.path.join(ckpt_path, "model0-mp1.safetensors"),  # using convert.py from DeepSeek-v3, combine all 163 safetensors to one
        device_map="auto",
        offload_folder="/scratch/rr4549/offload",
        offload_buffers=True,
        offload_state_dict=True,
        max_memory={0:"70GB", 1:"70GB", 2:"70GB", 3:"70GB", "cpu":"300GB"},
        dtype=torch.bfloat16
    )

Running the script by
accelerate launch --multi_gpu --num_processes 4 --num_machines 1 generate2.py --ckpt-path /scratch/rr4549/DeepSeek-R1-Demo/ --config configs/config_671B.json --interactive --temperature 0.7 --max-new-tokens 1

The script crashes when trying to initialize model with empty weights
Error:

W0131 17:51:41.347000 3004393 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3004406 closing signal SIGTERM
W0131 17:51:41.352000 3004393 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3004407 closing signal SIGTERM
W0131 17:51:41.354000 3004393 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3004408 closing signal SIGTERM
E0131 17:51:42.033000 3004393 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -9) local_rank: 0 (pid: 3004405) of binary: /ext3/miniforge3/bin/python3.12
Traceback (most recent call last):
  File "/ext3/miniforge3/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/ext3/miniforge3/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/ext3/miniforge3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1163, in launch_command
    multi_gpu_launcher(args)
  File "/ext3/miniforge3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 792, in multi_gpu_launcher
    distrib_run.run(args)
  File "/ext3/miniforge3/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/ext3/miniforge3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ext3/miniforge3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
generate2.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-31_17:51:41
  host      : ***
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 3004405)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 3004405
========================================================
slurmstepd: error: Detected 1 oom_kill event in StepId=56705791.batch. Some of the step tasks have been OOM Killed.

Expected behavior

The model weights can offload to the CPU and disk and do model inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant