-
Couldn't load subscription status.
- Fork 58
Enabled VLMs via CLI on v1.19.3 #297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Removing onnx_defer_loading flag which was originally removed in _[Removed onnx_defer_loading from Immutable Convertor Args. PR: 230]_ but got added back later in _[Mllama(single + dual) + InternVL(single) + Llava (single) PR: 267]_ maybe becausing of rebasing. Signed-off-by: Shubham Agrawal <[email protected]>
9567658 to
ad06845
Compare
QEfficient/base/common.py
Outdated
| model_class = QEFFAutoModelForCausalLM | ||
| class_name = MODEL_CLASS_MAPPING.get(architecture) | ||
| if class_name: | ||
| module = importlib.import_module("QEfficient.transformers.models.modeling_auto") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to use importlib here?
Is it causing circular import without it?
QEfficient/base/common.py
Outdated
| MODEL_CLASS_MAPPING = {} | ||
| for architecture in mapping.MODEL_FOR_CAUSAL_LM_MAPPING_NAMES.values(): | ||
| MODEL_CLASS_MAPPING[architecture] = "QEFFAutoModelForCausalLM" | ||
|
|
||
| for architecture in mapping.MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES.values(): | ||
| MODEL_CLASS_MAPPING[architecture] = "QEFFAutoModelForImageTextToText" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we resort to simple technique like if the architecture name has suffix lm_head or causalLM it's text only.
instead if it has ConditionalGeneration it's image-text.
We can take a look at all suffixes in the above maps and decide this logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will generate ambiguity as even in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES we have architecture which have ConditionalGeneration in their names, eg : https://github.com/huggingface/transformers/blob/6966fa190172b48b2fb46fe4552a13b943e692cf/src/transformers/models/auto/modeling_auto.py#L523
Also, there are many different architecture names in MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES, therefore mapping to all these names would not be possible as in future more architectures might add with different names.
QEfficient/cloud/infer.py
Outdated
| allow_mxint8_mdp_io=allow_mxint8_mdp_io, | ||
| enable_qnn=enable_qnn, | ||
| qnn_config=qnn_config, | ||
| img_size=img_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this fail when we pass image_size and the model is causalLM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can choose to drop such extra params in case of causalLM like image_url , img_size etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this condition was discussed on original PR #287, we have popped out the image_size before passing it to compile of CausalLM models.
QEfficient/cloud/infer.py
Outdated
| else: | ||
| raise FileNotFoundError( | ||
| 'Neither Image URL nor Image Path is found, either provide "image_url" or "image_path"' | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a nor condition and fail there. also if both are passed, we should issue warning saying one of them will be ignored.
QEfficient/cloud/infer.py
Outdated
| conversation = [ | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| {"type": "image"}, | ||
| {"type": "text", "text": prompt[0]}, # Currently accepting only 1 prompt | ||
| ], | ||
| }, | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this common for any image-text model. at least for the ones that we support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should we keep it here? can we keep it in the constants file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this works for both Llava and MLlama.
| vision_onnx_path = compiler_options.get("vision_onnx_path", None) | ||
| lang_onnx_path = compiler_options.get("lang_onnx_path", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? aren't those already parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are params, but if user passed them then it will overwrite them with user's input
QEfficient/utils/constants.py
Outdated
| "--float_bitwidth ", | ||
| "--preserve_io_datatype", | ||
| "--onnx_skip_simplification", | ||
| "--onnx_defer_loading", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this being changed in this PR?
This will create a config JSON file, which contains all the details
about compilation and SDK versions.
Currently, this code is added in the code block of
QEFFAutoModelForCausalLM.compile.
The config would look like below:
```
{
"huggingface_config": {
"vocab_size": 50257,
"n_positions": 1024,
"n_embd": 768,
"n_layer": 12,
"n_head": 12,
"n_inner": null,
"activation_function": "gelu_new",
"resid_pdrop": 0.1,
"embd_pdrop": 0.1,
"attn_pdrop": 0.1,
"layer_norm_epsilon": 1e-05,
"initializer_range": 0.02,
"summary_type": "cls_index",
"summary_use_proj": true,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"scale_attn_weights": true,
"use_cache": true,
"scale_attn_by_inverse_layer_idx": false,
"reorder_and_upcast_attn": false,
"bos_token_id": 50256,
"eos_token_id": 50256,
"return_dict": true,
"output_hidden_states": false,
"output_attentions": false,
"torchscript": false,
"torch_dtype": null,
"use_bfloat16": false,
"tf_legacy_loss": false,
"pruned_heads": {},
"tie_word_embeddings": true,
"chunk_size_feed_forward": 0,
"is_encoder_decoder": false,
"is_decoder": false,
"cross_attention_hidden_size": null,
"add_cross_attention": false,
"tie_encoder_decoder": false,
"max_length": 20,
"min_length": 0,
"do_sample": false,
"early_stopping": false,
"num_beams": 1,
"num_beam_groups": 1,
"diversity_penalty": 0.0,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"typical_p": 1.0,
"repetition_penalty": 1.0,
"length_penalty": 1.0,
"no_repeat_ngram_size": 0,
"encoder_no_repeat_ngram_size": 0,
"bad_words_ids": null,
"num_return_sequences": 1,
"output_scores": false,
"return_dict_in_generate": false,
"forced_bos_token_id": null,
"forced_eos_token_id": null,
"remove_invalid_values": false,
"exponential_decay_length_penalty": null,
"suppress_tokens": null,
"begin_suppress_tokens": null,
"architectures": [
"GPT2LMHeadModel"
],
"finetuning_task": null,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"tokenizer_class": null,
"prefix": null,
"pad_token_id": null,
"sep_token_id": null,
"decoder_start_token_id": null,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"problem_type": null,
"_name_or_path": "gpt2",
"_commit_hash": "607a30d783dfa663caf39e06633721c8d4cfcd7e",
"_attn_implementation_internal": "eager",
"transformers_version": null,
"model_type": "gpt2",
"n_ctx": 1024
},
"qpc_config": {
"QEff_config": {
"pytorch_transforms": [
"AwqToMatmulNbitsTransform",
"GPTQToMatmulNbitsTransform",
"CustomOpsTransform",
"KVCacheTransform"
],
"onnx_transforms": [
"FP16ClipTransform",
"SplitTensorsTransform"
],
"onnx_path": "/root/.cache/qeff_models/GPT2LMHeadModel-36f0eca92731bb47/GPT2LMHeadModel.onnx"
},
"aic_compiler_config": {
"apps_sdk_version": "1.20.0",
"compile_dir": "/root/.cache/qeff_models/GPT2LMHeadModel-36f0eca92731bb47",
"specializtions_file_path": "/root/.cache/qeff_models/GPT2LMHeadModel-36f0eca92731bb47/specializations.json",
"prefill_seq_len": 32,
"ctx_len": 128,
"batch_size": 1,
"full_batch_size": null,
"num_devices": 1,
"num_cores": 16,
"mxfp6_matmul": false,
"mxint8_kv_cache": false,
"num_speculative_tokens": null
},
"qnn_config": {
"enable_qnn": true,
"qnn_config_path": "QEfficient/compile/qnn_config.json",
"product": "QAIRT",
"os": {
"Ubuntu": 22.04,
"Windows": 11
},
"sdk_flavor": [
"aic"
],
"version": "2.31.0",
"build_id": "250109072054_3882",
"qnn_backend_api_version": "2.18.0",
"tensorflow": "2.10.1",
"tflite": "2.3.0",
"torch": "1.13.1",
"onnx": "1.16.1",
"onnxruntime": "1.17.1",
"onnxsimplifier": "0.4.36",
"android-ndk": "r26c",
"platform": "AIC.1.20.0.14"
}
}
}
```
Note: The code structure may change.
---------
Signed-off-by: Abukhoyer Shaik <[email protected]>
QEfficient/cloud/infer.py
Outdated
| ) | ||
| parser.add_argument( | ||
| "qnn_config", | ||
| "--qnn_config", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current workflow qnn_config is taken as an optional argument. Changing it to a positional argument would disrupt the existing flow. Instead we should pass a constant value of True for --enable_qnn and remove the qnn_config argument. @shubhagr-quic any thoughts on it?
| device_ids=device_group, | ||
| generation_len=generation_len, | ||
| ) | ||
| print(output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not be printing it this way in infer. Can we set a verbose level and print it accordingly from Auto classes itself @ochougul @quic-amitraj ?
QEfficient/cloud/infer.py
Outdated
| conversation = [ | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| {"type": "image"}, | ||
| {"type": "text", "text": prompt[0]}, # Currently accepting only 1 prompt | ||
| ], | ||
| }, | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should we keep it here? can we keep it in the constants file?
QEfficient/cloud/infer.py
Outdated
| nargs="?", | ||
| type=str, | ||
| ) | ||
| parser.add_argument("--img-size", "--img_size", default=None, type=int, required=False, help="Size of Image") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are taking image path and image url as kwarg its better to keep it consistent. There is no point in only providing img_size as arg.
bc60d47 to
76e863a
Compare
… validation page (quic#303) Signed-off-by: Abukhoyer Shaik <[email protected]>
This is just small fixes done for printing the `QEFFAutoModelForCausalLM`'s instance by changing the `__repr__(self)` method. Signed-off-by: Abukhoyer Shaik <[email protected]>
76e863a to
8d99a93
Compare
Removing onnx_defer_loading flag which was originally removed in _[Removed onnx_defer_loading from Immutable Convertor Args. PR: 230]_ but got added back later in _[Mllama(single + dual) + InternVL(single) + Llava (single) PR: 267]_ maybe becausing of rebasing. Signed-off-by: Shubham Agrawal <[email protected]> Signed-off-by: Asmita Goswami <[email protected]>
This will create a config JSON file, which contains all the details
about compilation and SDK versions.
Currently, this code is added in the code block of
QEFFAutoModelForCausalLM.compile.
The config would look like below:
```
{
"huggingface_config": {
"vocab_size": 50257,
"n_positions": 1024,
"n_embd": 768,
"n_layer": 12,
"n_head": 12,
"n_inner": null,
"activation_function": "gelu_new",
"resid_pdrop": 0.1,
"embd_pdrop": 0.1,
"attn_pdrop": 0.1,
"layer_norm_epsilon": 1e-05,
"initializer_range": 0.02,
"summary_type": "cls_index",
"summary_use_proj": true,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"scale_attn_weights": true,
"use_cache": true,
"scale_attn_by_inverse_layer_idx": false,
"reorder_and_upcast_attn": false,
"bos_token_id": 50256,
"eos_token_id": 50256,
"return_dict": true,
"output_hidden_states": false,
"output_attentions": false,
"torchscript": false,
"torch_dtype": null,
"use_bfloat16": false,
"tf_legacy_loss": false,
"pruned_heads": {},
"tie_word_embeddings": true,
"chunk_size_feed_forward": 0,
"is_encoder_decoder": false,
"is_decoder": false,
"cross_attention_hidden_size": null,
"add_cross_attention": false,
"tie_encoder_decoder": false,
"max_length": 20,
"min_length": 0,
"do_sample": false,
"early_stopping": false,
"num_beams": 1,
"num_beam_groups": 1,
"diversity_penalty": 0.0,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"typical_p": 1.0,
"repetition_penalty": 1.0,
"length_penalty": 1.0,
"no_repeat_ngram_size": 0,
"encoder_no_repeat_ngram_size": 0,
"bad_words_ids": null,
"num_return_sequences": 1,
"output_scores": false,
"return_dict_in_generate": false,
"forced_bos_token_id": null,
"forced_eos_token_id": null,
"remove_invalid_values": false,
"exponential_decay_length_penalty": null,
"suppress_tokens": null,
"begin_suppress_tokens": null,
"architectures": [
"GPT2LMHeadModel"
],
"finetuning_task": null,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"tokenizer_class": null,
"prefix": null,
"pad_token_id": null,
"sep_token_id": null,
"decoder_start_token_id": null,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"problem_type": null,
"_name_or_path": "gpt2",
"_commit_hash": "607a30d783dfa663caf39e06633721c8d4cfcd7e",
"_attn_implementation_internal": "eager",
"transformers_version": null,
"model_type": "gpt2",
"n_ctx": 1024
},
"qpc_config": {
"QEff_config": {
"pytorch_transforms": [
"AwqToMatmulNbitsTransform",
"GPTQToMatmulNbitsTransform",
"CustomOpsTransform",
"KVCacheTransform"
],
"onnx_transforms": [
"FP16ClipTransform",
"SplitTensorsTransform"
],
"onnx_path": "/root/.cache/qeff_models/GPT2LMHeadModel-36f0eca92731bb47/GPT2LMHeadModel.onnx"
},
"aic_compiler_config": {
"apps_sdk_version": "1.20.0",
"compile_dir": "/root/.cache/qeff_models/GPT2LMHeadModel-36f0eca92731bb47",
"specializtions_file_path": "/root/.cache/qeff_models/GPT2LMHeadModel-36f0eca92731bb47/specializations.json",
"prefill_seq_len": 32,
"ctx_len": 128,
"batch_size": 1,
"full_batch_size": null,
"num_devices": 1,
"num_cores": 16,
"mxfp6_matmul": false,
"mxint8_kv_cache": false,
"num_speculative_tokens": null
},
"qnn_config": {
"enable_qnn": true,
"qnn_config_path": "QEfficient/compile/qnn_config.json",
"product": "QAIRT",
"os": {
"Ubuntu": 22.04,
"Windows": 11
},
"sdk_flavor": [
"aic"
],
"version": "2.31.0",
"build_id": "250109072054_3882",
"qnn_backend_api_version": "2.18.0",
"tensorflow": "2.10.1",
"tflite": "2.3.0",
"torch": "1.13.1",
"onnx": "1.16.1",
"onnxruntime": "1.17.1",
"onnxsimplifier": "0.4.36",
"android-ndk": "r26c",
"platform": "AIC.1.20.0.14"
}
}
}
```
Note: The code structure may change.
---------
Signed-off-by: Abukhoyer Shaik <[email protected]>
Signed-off-by: Asmita Goswami <[email protected]>
… validation page (quic#303) Signed-off-by: Abukhoyer Shaik <[email protected]> Signed-off-by: Asmita Goswami <[email protected]>
This is just small fixes done for printing the `QEFFAutoModelForCausalLM`'s instance by changing the `__repr__(self)` method. Signed-off-by: Abukhoyer Shaik <[email protected]> Signed-off-by: Asmita Goswami <[email protected]>
Signed-off-by: Asmita Goswami <[email protected]>
Signed-off-by: Asmita Goswami <[email protected]>
Signed-off-by: Asmita Goswami <[email protected]>
Signed-off-by: Asmita Goswami <[email protected]>
Signed-off-by: Asmita Goswami <[email protected]>
Signed-off-by: Asmita Goswami <[email protected]>
3165896 to
1608804
Compare
|
Closing this PR as this feature will be merged into mainline. |
Added support for enabling VLMs via CLI.
Sample command:
python -m QEfficient.cloud.infer --model_name meta-llama/Llama-3.2-11B-Vision-Instruct --batch_size 1 --prompt_len 32 --ctx_len 512 --num_cores 16 --device_group [0] --prompt "Descrive the image?" --mos 1 --allocator_dealloc_delay 1 --image_url https://i.etsystatic.com/8155076/r/il/0825c2/1594869823/il_fullxfull.1594869823_5x0w.jpg