Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VLM: Model Tracing Guide #1030

Merged
merged 369 commits into from
Jan 23, 2025
Merged

VLM: Model Tracing Guide #1030

merged 369 commits into from
Jan 23, 2025

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Jan 2, 2025

Purpose

This guide explains the concepts of tracing as they relate to LLM Compressor and how to modify your model to support recipes which require using the Sequential Pipeline.

Through reading this guide, you will learn

  1. Why tracing is required when compressing with recipes involving the Sequential Pipeline and modifiers such as GPTQModifier
  2. How to determine if your model is traceable for your dataset
  3. How to modify your model definition to be traceable

Prerequisites

Changes

  • Add a model tracing guide src/llmcompressor/transformers/tracing/README.md with pictures
  • Add a readme for the sequential pipeline which points to the Tracing Guide src/llmcompressor/pipelines/sequential/README.md
  • Add a debug script to help users debug their models for traceability src/llmcompressor/transformers/tracing/debug.py
    • Add the llm-compressor.attempt_trace entrypoint for ease of use
  • Swap the order of arguments in llava_example.py and and pixtral_example.py to match the order of arguments on the modifier

Testing

Use the llmcompressor.trace debug script

llmcompressor.trace \
    --model_id llava-hf/llava-1.5-7b-hf
    --model_class TraceableLlavaForConditionalGeneration
    --sequential-targets LlamaDecoderLayer
    --ignore "re:.*lm_head" "re:vision_tower.*" "re:multi_modal_projector.*"
    --modality vision

Stretch

It might be nice if this tracing debugger tool also printed the model graph to an svg

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
…tokenized datasets should not be given labels

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
kylesayrs added a commit that referenced this pull request Jan 15, 2025
## Purpose ##
* Allow VLM processors to be used to tokenize datasets with prompt keys

## Postrequisites ##
* #1030

## Changes ##
* Use `text` argument name for tokenizing the prompt column

## Testing ##
* w.r.t. tokenizers, using the `text` kwarg follows the precedent set by
[PretrainedTokenizerBase](https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L2790)
* w.r.t. processors, most processors use the text kwarg

Below are all the models I know to be compatible with this change, I'm
assuming that most other processors follow the same standard
1.
[llama](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py#L233)
2.
[pixtral](https://github.com/huggingface/transformers/blob/main/src/transformers/models/pixtral/processing_pixtral.py#L160)
3.
[phi3_vision](https://huggingface.co/microsoft/Phi-3.5-vision-instruct/blob/main/processing_phi3_v.py#L321)
4.
[mllama](https://github.com/huggingface/transformers/blob/main/src/transformers/models/mllama/processing_mllama.py#L232)
5.
[qwen2_vl](https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/processing_qwen2_vl.py#L71)

Example of using VLM processor to tokenize a dataset with prompt key
```python3
from transformers import AutoProcessor
from llmcompressor.transformers import DataTrainingArguments, TextGenerationDataset

models_to_test = [
  "meta-llama/Meta-Llama-3-8B-Instruct",
  "mistralai/Mixtral-8x7B-Instruct-v0.1",
  "Qwen/Qwen2-VL-2B-Instruct",  # fails without changes
  "mgoin/pixtral-12b",  # fails without changes
]

for model_id in models_to_test:
  processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

  data_args = DataTrainingArguments(
      dataset="ultrachat-200k",
      splits={"calibration": "test_sft[:1]"}
  )

  dataset = TextGenerationDataset.load_from_registry(
      data_args.dataset,
      data_args=data_args,
      split=data_args.splits["calibration"],
      processor=processor,
  )(add_labels=False)
```

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
mgoin
mgoin previously approved these changes Jan 20, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, we should consider adding a readthedoc build like vLLM to render these out

Signed-off-by: Kyle Sayers <[email protected]>

Co-authored-by: Michael Goin <[email protected]>
@kylesayrs kylesayrs mentioned this pull request Jan 20, 2025
Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job.

A couple of nits:

  1. I wouldnt refer to the SparseGPTModifier until we've actually started using data pipelines outside of the GPTQModifier
  2. A helpful comment on what to focus on when looking at the images would be nice

@dsikka dsikka merged commit e48d9db into main Jan 23, 2025
6 of 7 checks passed
@dsikka dsikka deleted the kylesayrs/traceability-readme branch January 23, 2025 17:01
dsikka pushed a commit that referenced this pull request Jan 27, 2025
## Purpose ##
* Create a landing page for those looking to use VLMs
* Advertise VLM support on homepage

## Prerequisites ##
* #1030

---------

Signed-off-by: Kyle Sayers <[email protected]>
Co-authored-by: Brian Dellabetta <[email protected]>
rahul-tuli pushed a commit that referenced this pull request Jan 28, 2025
## Purpose ##
* Create a landing page for those looking to use VLMs
* Advertise VLM support on homepage

## Prerequisites ##
* #1030

---------

Signed-off-by: Kyle Sayers <[email protected]>
Co-authored-by: Brian Dellabetta <[email protected]>
Signed-off-by: Rahul Tuli <[email protected]>
dsikka added a commit that referenced this pull request Feb 5, 2025
## Purpose ##
* Remove layer compressor to decouple modifiers from data pipelines
* Reduce abstractions
* Support VLMs with SparseGPT and Wanda

## Prerequisites ##
* #1021
* #1023
* #1068
* #1030

## Changes ##
### Interface/ Features ###
* SparseGPT and Wanda now both support VLM architectures
* Added `sequential_targets` to match GPTQ and made `targets` an alias
* Support hessian offloading for `SparseGPT`
* Add customized `_LinAlgError` for `SparseGPT`

### Implementations ###
* Changed implementation styles of `SparseGPTModifier` and
`WandaPruningModifier` to match `GPTQModifier`
* Removed `LayerCompressor`, `ModuleCompressionWrapper`,
`SparseGptWrapper`, and `WandaWrapper`
* Shared implementations between SparseGPT and Wanda are implemented by
the `SparsityModifierMixin`
* Removed lines blocking `allow_tf32`
* Maybe @rahul-tuli knows why this was originally implemented,
potentially to avoid hardware issues?
* This change was only present for wanda. Given that all other modifiers
do not have this change, I see no reason why it should stay
* Updated sparsegpt tests to reflect new implementation

### Tests ###
* Updated obcq tests to reflect new implementations
* Removed `test_sgpt_defaults.py` since this test doesn't test anything
new or novel about this modifier

## Testing ##
* `grep -r
"LayerCompressor\|ModuleCompressionWrapper\|SparseGptWrapper\|WandaWrapper"
src/ examples/ tests/`
* Modified `test_invalid_layerwise_recipes_raise_exceptions` and
`test_successful_layerwise_recipe` pass
* `llama3_8b_2of4.py` passes and was evaluated with both SparseGPT and
Wanda

## Potential Follow ups ##
* Add module `targets` and `ignore` to SparseGPT and Wanda

## Regression Testing ##
The hessian, row scalar, and compressed weight values were confirmed to
be unchanged in the case that of one calibration sample. The final
evaluations are different, which is likely due to numerical imprecision
(dividing by int vs torch.int), different pipelines (different subgraph
partitions => different imprecision from cpu offloading, potentially
different module arguments).

### Evaluation
Models were compressed using
`examples/sparse_2of4_quantization_fp8/llama3_8b_2of4.py`
<details><summary>sparsegpt</summary>

Main
```
hf (pretrained=/home/ksayers/llm-compressor/old_Llama-3.2-1B-Instruct2of4-sparse,dtype=bfloat16,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1                                                           
|  Tasks   |Version|Filter|n-shot|Metric|   |Value |   |Stderr|                                                        
|----------|------:|------|-----:|------|---|-----:|---|-----:|                                                        
|winogrande|      1|none  |     5|acc   |?  |0.5391|?  | 0.014|
```

Branch
```
hf (pretrained=/home/ksayers/llm-compressor/new_Llama-3.2-1B-Instruct2of4-sparse,dtype=bfloat16,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|  Tasks   |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|----------|------:|------|-----:|------|---|----:|---|-----:|
|winogrande|      1|none  |     5|acc   |?  |0.547|?  | 0.014|
```
</details>

To test wanda, the `SparseGPTModifier` was replaced with the
`WandaPruningModifier`

<details><summary>wanda</summary>

Main
```
hf (pretrained=/home/kyle/old_llm-compressor/Llama-3.2-1B-Instruct2of4-sparse,dtype=bfloat16,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|  Tasks   |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|----------|------:|------|-----:|------|---|----:|---|-----:|
|winogrande|      1|none  |     5|acc   |↑  |0.532|±  | 0.014|
```

Branch
```
hf (pretrained=/home/kyle/llm-compressor/Llama-3.2-1B-Instruct2of4-sparse,dtype=bfloat16,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|  Tasks   |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|----------|------:|------|-----:|------|---|-----:|---|-----:|
|winogrande|      1|none  |     5|acc   |↑  |0.5414|±  | 0.014|
```
</details>

---------

Signed-off-by: Kyle Sayers <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
kylesayrs added a commit that referenced this pull request Mar 11, 2025
## Purpose ##
* Provide a predefined audio dataset for
  * Testing traceability of audio models
  * e2e tests with audio models
  * Simpler examples (blog)

## Prerequisites ##
* #1030
* #1085

## Changes ##
* Implement `PeoplesSpeech` dataset
* Because of the more complex nature of audio processors, this dataset
needs to hardcode some processing logic specific to models
* Assumes that most processing is similar to whisper processing, which
seems to be the standard
* Because processing changes depending on the model, this means mapped
outputs cannot be cached
* Add `load_from_cache_file` argument to preprocessing mapping (this was
overlooked before)
* Integrate dataset with tracing debugger tool

## Testing ##
```bash
llmcompressor.trace \
    --model_id openai/whisper-large-v2\
    --model_class TraceableWhisperForConditionalGeneration\
    --modality audio
```

Traceable definition of qwen2_audio is not finished yet, but this loads
and is accepted as valid input
```bash
llmcompressor.trace \
    --model_id Qwen/Qwen2-Audio-7B\
    --model_class Qwen2AudioForConditionalGeneration\
    --modality audio
```

---------

Signed-off-by: Kyle Sayers <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants