SmoothQuant failing on multi-gpus #1081

anmarques · 2025-01-18T22:24:09Z

Describe the bug
I get a CUDA error when running SmoothQuant on multiple GPUs. Tried on different CUDA versions without success. The error seems to come from a synchronization failure during an empty_cuda() call that is used after every forward pass in the calibration step. Moving empty_cuda() to after all batches are processed seems to fix the issue (see branch fix/smoothquant_multigpu)

Expected behavior
Run SmoothQuant on multiple gpus w/o errors.

Environment
Include all relevant environment information:

OS: Ubuntu 20.04:
Python version: 3.10.12
LLM Compressor version or commit hash: b175943
ML framework version(s): torch 2.5.1
Other Python package versions:
Other relevant environment information:

CUDA driver: 12.5
GPU driver: 555.42.02
Python CUDA libraries:
- nvidia-cublas-cu12==12.4.5.8
- nvidia-cuda-cupti-cu12==12.4.127
- nvidia-cuda-nvrtc-cu12==12.4.127
- nvidia-cuda-runtime-cu12==12.4.127
- nvidia-cudnn-cu12==9.1.0.70
- nvidia-cufft-cu12==11.2.1.3
- nvidia-curand-cu12==10.3.5.147
- nvidia-cusolver-cu12==11.6.1.9
- nvidia-cusparse-cu12==12.3.1.170
- nvidia-nccl-cu12==2.21.5
- nvidia-nvjitlink-cu12==12.4.127
- nvidia-nvtx-cu12==12.4.127

To Reproduce
Exact steps to reproduce the behavior:
Run oneshot on Llama-3.3-70B-Instruct with 1024 samples of the LLM_compression_calibration dataset, with sequence length limited to 8192, on 4 A100 GPUs, using the following recipe:

quant_stage:
  quant_modifiers:
    SmoothQuantModifier:
      smoothing_strength: 0.0
      mappings:
        - [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"]
        - [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"]
        - [["re:.*down_proj"], "re:.*up_proj"]
    GPTQModifier:
      sequential_update: true
      dampening_frac: 0.0
      ignore: ["lm_head"]
      config_groups:
        group_0:
          targets: ["Linear"]
          weights:
            num_bits: 8
            type: "int"
            symmetric: true
            strategy: "channel"
            observer: "mse"
          input_activations:
            num_bits: 8
            type: "int"
            symmetric: true
            strategy: "token"
            dynamic: true
            observer: "memoryless"

Errors
Traceback (most recent call last):
File "/root/.clearml/venvs-builds/3.10/code/queue_llmcompressor_oneshot.py", line 267, in
oneshot(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/text_generation.py", line 84, in oneshot
main(model_args, data_args, training_args)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/text_generation.py", line 413, in main
stage_runner.one_shot()
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/runner.py", line 163, in one_shot
self.trainer.one_shot(calibration_data=calib_data, stage=stage)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/session_mixin.py", line 440, in one_shot
apply(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/session_functions.py", line 184, in apply
return active_session().apply(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/session.py", line 212, in apply
self.initialize(**kwargs)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/session.py", line 158, in initialize
mod_data = self._lifecycle.initialize(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/lifecycle.py", line 126, in initialize
data = mod.initialize(state=self.state, **extras)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/stage.py", line 124, in initialize
modifier.initialize(state, **kwargs)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/modifier.py", line 118, in initialize
initialized = self.on_initialize(state=state, **kwargs)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/smoothquant/base.py", line 136, in on_initialize
self._calibrate(state.model, calibration_dataloader)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/smoothquant/base.py", line 253, in _calibrate
run_calibration_forward(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/utils/pytorch_helpers.py", line 107, in run_calibration_forward
torch.cuda.empty_cache()
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/torch/cuda/memory.py", line 192, in empty_cache
torch._C._cuda_emptyCache()
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Additional context
Attached full log file

task_1e407209f55046d399d387a0fd1858bb.log

.

The text was updated successfully, but these errors were encountered:

dsikka · 2025-01-19T12:23:24Z

@rahul-tuli Please take a look

kylesayrs · 2025-01-29T04:02:17Z

I encounter consistently encounter this issue on a single GPU, specifically with smoothquant. From my research, there's really no explanation as to why this happens. I've opened a PR with more details and a fix

kylesayrs · 2025-02-06T19:19:53Z

Fixed by #1114

…context` (#1114) ## Purpose ## * Fixes #1081 * Fixes #963 * There's really no explanation online as to why the `torch.cuda.empty_cache()` kernel sometimes fails to launch. Given that `empty_cache` does not actually free memory that wouldn't have already been freed by the python garbage collector + [pytorch caching allocator](https://zdevito.github.io/2022/08/04/cuda-caching-allocator.html), it should be safe to remove this call. ## Changes ## * Remove `torch.cuda.empty_cache()` in `run_calibration_forward`, which only affects smoothquant and quantization modifier (sparsegpt and wanda will soon use sequential pipelines instead) * Use `calibration_forward_context` in smoothquant and quantization modifier * Remove use of `torch.cuda.empty_cache()` by smoothquant modiifier ## Testing ## * Performed memory analysis with and without `torch.cuda.empty_cache` and `calibration_forward_context` independently ### Smooth Quant ### ![20c0e104-2353-4a09-9556-f953075205d2](https://github.com/user-attachments/assets/a6727da5-8350-449b-82b6-eff8f6d3d592) ### Quantization Modifier ### ![0a0451e2-108e-40fb-be5c-e9619928ab67](https://github.com/user-attachments/assets/325c2124-734f-40eb-ac3b-77debf45389e) It was also found that removing the `empty_cache` calls in between each operation reduced the runtime of Quantization Modifier on llama3-8B by 78% Before ``` 512/512 [03:18<00:00, 2.58it/s] Duration: 199.38174653053284 ``` After ``` 512/512 [00:42<00:00, 11.91it/s] Duration: 44.374401807785034 ``` --------- Signed-off-by: Kyle Sayers <[email protected]> Co-authored-by: Dipika Sikka <[email protected]>

anmarques added the bug Something isn't working label Jan 18, 2025

dsikka assigned rahul-tuli Jan 19, 2025

kylesayrs mentioned this issue Jan 29, 2025

SQ and QM: Remove torch.cuda.empty_cache, use calibration_forward_context #1114

Merged

kylesayrs self-assigned this Jan 29, 2025

dsikka closed this as completed Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SmoothQuant failing on multi-gpus #1081

SmoothQuant failing on multi-gpus #1081

anmarques commented Jan 18, 2025

dsikka commented Jan 19, 2025

kylesayrs commented Jan 29, 2025

kylesayrs commented Feb 6, 2025

SmoothQuant failing on multi-gpus #1081

SmoothQuant failing on multi-gpus #1081

Comments

anmarques commented Jan 18, 2025

dsikka commented Jan 19, 2025

kylesayrs commented Jan 29, 2025

kylesayrs commented Feb 6, 2025