Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SmoothQuant failing on multi-gpus #1081

Closed
anmarques opened this issue Jan 18, 2025 · 3 comments · Fixed by #1114
Closed

SmoothQuant failing on multi-gpus #1081

anmarques opened this issue Jan 18, 2025 · 3 comments · Fixed by #1114
Assignees
Labels
bug Something isn't working

Comments

@anmarques
Copy link
Collaborator

Describe the bug
I get a CUDA error when running SmoothQuant on multiple GPUs. Tried on different CUDA versions without success. The error seems to come from a synchronization failure during an empty_cuda() call that is used after every forward pass in the calibration step. Moving empty_cuda() to after all batches are processed seems to fix the issue (see branch fix/smoothquant_multigpu)

Expected behavior
Run SmoothQuant on multiple gpus w/o errors.

Environment
Include all relevant environment information:

  1. OS: Ubuntu 20.04:
  2. Python version: 3.10.12
  3. LLM Compressor version or commit hash: b175943
  4. ML framework version(s): torch 2.5.1
  5. Other Python package versions:
  6. Other relevant environment information:
  • CUDA driver: 12.5
  • GPU driver: 555.42.02
  • Python CUDA libraries:
    • nvidia-cublas-cu12==12.4.5.8
    • nvidia-cuda-cupti-cu12==12.4.127
    • nvidia-cuda-nvrtc-cu12==12.4.127
    • nvidia-cuda-runtime-cu12==12.4.127
    • nvidia-cudnn-cu12==9.1.0.70
    • nvidia-cufft-cu12==11.2.1.3
    • nvidia-curand-cu12==10.3.5.147
    • nvidia-cusolver-cu12==11.6.1.9
    • nvidia-cusparse-cu12==12.3.1.170
    • nvidia-nccl-cu12==2.21.5
    • nvidia-nvjitlink-cu12==12.4.127
    • nvidia-nvtx-cu12==12.4.127

To Reproduce
Exact steps to reproduce the behavior:
Run oneshot on Llama-3.3-70B-Instruct with 1024 samples of the LLM_compression_calibration dataset, with sequence length limited to 8192, on 4 A100 GPUs, using the following recipe:

quant_stage:
  quant_modifiers:
    SmoothQuantModifier:
      smoothing_strength: 0.0
      mappings:
        - [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"]
        - [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"]
        - [["re:.*down_proj"], "re:.*up_proj"]
    GPTQModifier:
      sequential_update: true
      dampening_frac: 0.0
      ignore: ["lm_head"]
      config_groups:
        group_0:
          targets: ["Linear"]
          weights:
            num_bits: 8
            type: "int"
            symmetric: true
            strategy: "channel"
            observer: "mse"
          input_activations:
            num_bits: 8
            type: "int"
            symmetric: true
            strategy: "token"
            dynamic: true
            observer: "memoryless"

Errors
Traceback (most recent call last):
File "/root/.clearml/venvs-builds/3.10/code/queue_llmcompressor_oneshot.py", line 267, in
oneshot(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/text_generation.py", line 84, in oneshot
main(model_args, data_args, training_args)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/text_generation.py", line 413, in main
stage_runner.one_shot()
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/runner.py", line 163, in one_shot
self.trainer.one_shot(calibration_data=calib_data, stage=stage)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/session_mixin.py", line 440, in one_shot
apply(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/session_functions.py", line 184, in apply
return active_session().apply(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/session.py", line 212, in apply
self.initialize(**kwargs)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/session.py", line 158, in initialize
mod_data = self._lifecycle.initialize(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/lifecycle.py", line 126, in initialize
data = mod.initialize(state=self.state, **extras)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/stage.py", line 124, in initialize
modifier.initialize(state, **kwargs)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/modifier.py", line 118, in initialize
initialized = self.on_initialize(state=state, **kwargs)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/smoothquant/base.py", line 136, in on_initialize
self._calibrate(state.model, calibration_dataloader)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/smoothquant/base.py", line 253, in _calibrate
run_calibration_forward(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/utils/pytorch_helpers.py", line 107, in run_calibration_forward
torch.cuda.empty_cache()
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/torch/cuda/memory.py", line 192, in empty_cache
torch._C._cuda_emptyCache()
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Additional context
Attached full log file

task_1e407209f55046d399d387a0fd1858bb.log

.

@anmarques anmarques added the bug Something isn't working label Jan 18, 2025
@dsikka
Copy link
Collaborator

dsikka commented Jan 19, 2025

@rahul-tuli Please take a look

@kylesayrs
Copy link
Collaborator

I encounter consistently encounter this issue on a single GPU, specifically with smoothquant. From my research, there's really no explanation as to why this happens. I've opened a PR with more details and a fix

@kylesayrs kylesayrs self-assigned this Jan 29, 2025
@dsikka dsikka closed this as completed Jan 31, 2025
@kylesayrs
Copy link
Collaborator

Fixed by #1114

dsikka added a commit that referenced this issue Feb 8, 2025
…context` (#1114)

## Purpose ##
* Fixes #1081
* Fixes #963
* There's really no explanation online as to why the
`torch.cuda.empty_cache()` kernel sometimes fails to launch. Given that
`empty_cache` does not actually free memory that wouldn't have already
been freed by the python garbage collector + [pytorch caching
allocator](https://zdevito.github.io/2022/08/04/cuda-caching-allocator.html),
it should be safe to remove this call.

## Changes ##
* Remove `torch.cuda.empty_cache()` in `run_calibration_forward`, which
only affects smoothquant and quantization modifier (sparsegpt and wanda
will soon use sequential pipelines instead)
* Use `calibration_forward_context` in smoothquant and quantization
modifier
* Remove use of `torch.cuda.empty_cache()` by smoothquant modiifier

## Testing ##
* Performed memory analysis with and without `torch.cuda.empty_cache`
and `calibration_forward_context` independently

### Smooth Quant ###

![20c0e104-2353-4a09-9556-f953075205d2](https://github.com/user-attachments/assets/a6727da5-8350-449b-82b6-eff8f6d3d592)

### Quantization Modifier ###

![0a0451e2-108e-40fb-be5c-e9619928ab67](https://github.com/user-attachments/assets/325c2124-734f-40eb-ac3b-77debf45389e)

It was also found that removing the `empty_cache` calls in between each
operation reduced the runtime of Quantization Modifier on llama3-8B by
78%

Before
```
512/512 [03:18<00:00,  2.58it/s]
Duration: 199.38174653053284
```

After
```
512/512 [00:42<00:00, 11.91it/s]
Duration: 44.374401807785034
```

---------

Signed-off-by: Kyle Sayers <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants