-
Notifications
You must be signed in to change notification settings - Fork 43
Enabled VLMs via CLI on v1.19.3 #297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Removing onnx_defer_loading flag which was originally removed in _[Removed onnx_defer_loading from Immutable Convertor Args. PR: 230]_ but got added back later in _[Mllama(single + dual) + InternVL(single) + Llava (single) PR: 267]_ maybe becausing of rebasing. Signed-off-by: Shubham Agrawal <[email protected]>
9567658
to
ad06845
Compare
QEfficient/base/common.py
Outdated
model_class = QEFFAutoModelForCausalLM | ||
class_name = MODEL_CLASS_MAPPING.get(architecture) | ||
if class_name: | ||
module = importlib.import_module("QEfficient.transformers.models.modeling_auto") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to use importlib
here?
Is it causing circular import without it?
QEfficient/base/common.py
Outdated
MODEL_CLASS_MAPPING = {} | ||
for architecture in mapping.MODEL_FOR_CAUSAL_LM_MAPPING_NAMES.values(): | ||
MODEL_CLASS_MAPPING[architecture] = "QEFFAutoModelForCausalLM" | ||
|
||
for architecture in mapping.MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES.values(): | ||
MODEL_CLASS_MAPPING[architecture] = "QEFFAutoModelForImageTextToText" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we resort to simple technique like if the architecture name has suffix lm_head
or causalLM
it's text only.
instead if it has ConditionalGeneration
it's image-text.
We can take a look at all suffixes in the above maps and decide this logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will generate ambiguity as even in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
we have architecture which have ConditionalGeneration
in their names, eg : https://github.com/huggingface/transformers/blob/6966fa190172b48b2fb46fe4552a13b943e692cf/src/transformers/models/auto/modeling_auto.py#L523
Also, there are many different architecture names in MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES
, therefore mapping to all these names would not be possible as in future more architectures might add with different names.
QEfficient/cloud/infer.py
Outdated
@@ -110,20 +117,70 @@ def main( | |||
allow_mxint8_mdp_io=allow_mxint8_mdp_io, | |||
enable_qnn=enable_qnn, | |||
qnn_config=qnn_config, | |||
img_size=img_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this fail when we pass image_size and the model is causalLM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can choose to drop such extra params in case of causalLM like image_url
, img_size
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this condition was discussed on original PR #287, we have popped out the image_size before passing it to compile of CausalLM models.
QEfficient/cloud/infer.py
Outdated
else: | ||
raise FileNotFoundError( | ||
'Neither Image URL nor Image Path is found, either provide "image_url" or "image_path"' | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a nor
condition and fail there. also if both are passed, we should issue warning saying one of them will be ignored.
QEfficient/cloud/infer.py
Outdated
conversation = [ | ||
{ | ||
"role": "user", | ||
"content": [ | ||
{"type": "image"}, | ||
{"type": "text", "text": prompt[0]}, # Currently accepting only 1 prompt | ||
], | ||
}, | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this common for any image-text model. at least for the ones that we support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should we keep it here? can we keep it in the constants file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this works for both Llava and MLlama.
vision_onnx_path = compiler_options.get("vision_onnx_path", None) | ||
lang_onnx_path = compiler_options.get("lang_onnx_path", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? aren't those already parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are params, but if user passed them then it will overwrite them with user's input
@@ -136,7 +136,6 @@ class QnnConstants: | |||
"--float_bitwidth ", | |||
"--preserve_io_datatype", | |||
"--onnx_skip_simplification", | |||
"--onnx_defer_loading", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this being changed in this PR?
This will create a config JSON file, which contains all the details about compilation and SDK versions. Currently, this code is added in the code block of QEFFAutoModelForCausalLM.compile. The config would look like below: ``` { "huggingface_config": { "vocab_size": 50257, "n_positions": 1024, "n_embd": 768, "n_layer": 12, "n_head": 12, "n_inner": null, "activation_function": "gelu_new", "resid_pdrop": 0.1, "embd_pdrop": 0.1, "attn_pdrop": 0.1, "layer_norm_epsilon": 1e-05, "initializer_range": 0.02, "summary_type": "cls_index", "summary_use_proj": true, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "scale_attn_weights": true, "use_cache": true, "scale_attn_by_inverse_layer_idx": false, "reorder_and_upcast_attn": false, "bos_token_id": 50256, "eos_token_id": 50256, "return_dict": true, "output_hidden_states": false, "output_attentions": false, "torchscript": false, "torch_dtype": null, "use_bfloat16": false, "tf_legacy_loss": false, "pruned_heads": {}, "tie_word_embeddings": true, "chunk_size_feed_forward": 0, "is_encoder_decoder": false, "is_decoder": false, "cross_attention_hidden_size": null, "add_cross_attention": false, "tie_encoder_decoder": false, "max_length": 20, "min_length": 0, "do_sample": false, "early_stopping": false, "num_beams": 1, "num_beam_groups": 1, "diversity_penalty": 0.0, "temperature": 1.0, "top_k": 50, "top_p": 1.0, "typical_p": 1.0, "repetition_penalty": 1.0, "length_penalty": 1.0, "no_repeat_ngram_size": 0, "encoder_no_repeat_ngram_size": 0, "bad_words_ids": null, "num_return_sequences": 1, "output_scores": false, "return_dict_in_generate": false, "forced_bos_token_id": null, "forced_eos_token_id": null, "remove_invalid_values": false, "exponential_decay_length_penalty": null, "suppress_tokens": null, "begin_suppress_tokens": null, "architectures": [ "GPT2LMHeadModel" ], "finetuning_task": null, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "tokenizer_class": null, "prefix": null, "pad_token_id": null, "sep_token_id": null, "decoder_start_token_id": null, "task_specific_params": { "text-generation": { "do_sample": true, "max_length": 50 } }, "problem_type": null, "_name_or_path": "gpt2", "_commit_hash": "607a30d783dfa663caf39e06633721c8d4cfcd7e", "_attn_implementation_internal": "eager", "transformers_version": null, "model_type": "gpt2", "n_ctx": 1024 }, "qpc_config": { "QEff_config": { "pytorch_transforms": [ "AwqToMatmulNbitsTransform", "GPTQToMatmulNbitsTransform", "CustomOpsTransform", "KVCacheTransform" ], "onnx_transforms": [ "FP16ClipTransform", "SplitTensorsTransform" ], "onnx_path": "/root/.cache/qeff_models/GPT2LMHeadModel-36f0eca92731bb47/GPT2LMHeadModel.onnx" }, "aic_compiler_config": { "apps_sdk_version": "1.20.0", "compile_dir": "/root/.cache/qeff_models/GPT2LMHeadModel-36f0eca92731bb47", "specializtions_file_path": "/root/.cache/qeff_models/GPT2LMHeadModel-36f0eca92731bb47/specializations.json", "prefill_seq_len": 32, "ctx_len": 128, "batch_size": 1, "full_batch_size": null, "num_devices": 1, "num_cores": 16, "mxfp6_matmul": false, "mxint8_kv_cache": false, "num_speculative_tokens": null }, "qnn_config": { "enable_qnn": true, "qnn_config_path": "QEfficient/compile/qnn_config.json", "product": "QAIRT", "os": { "Ubuntu": 22.04, "Windows": 11 }, "sdk_flavor": [ "aic" ], "version": "2.31.0", "build_id": "250109072054_3882", "qnn_backend_api_version": "2.18.0", "tensorflow": "2.10.1", "tflite": "2.3.0", "torch": "1.13.1", "onnx": "1.16.1", "onnxruntime": "1.17.1", "onnxsimplifier": "0.4.36", "android-ndk": "r26c", "platform": "AIC.1.20.0.14" } } } ``` Note: The code structure may change. --------- Signed-off-by: Abukhoyer Shaik <[email protected]>
QEfficient/cloud/infer.py
Outdated
@@ -226,10 +283,11 @@ def main( | |||
Sample Config: QEfficient/compile/qnn_config.json", | |||
) | |||
parser.add_argument( | |||
"qnn_config", | |||
"--qnn_config", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current workflow qnn_config is taken as an optional argument. Changing it to a positional argument would disrupt the existing flow. Instead we should pass a constant value of True for --enable_qnn and remove the qnn_config argument. @shubhagr-quic any thoughts on it?
device_ids=device_group, | ||
generation_len=generation_len, | ||
) | ||
print(output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not be printing it this way in infer. Can we set a verbose level and print it accordingly from Auto classes itself @ochougul @quic-amitraj ?
QEfficient/cloud/infer.py
Outdated
conversation = [ | ||
{ | ||
"role": "user", | ||
"content": [ | ||
{"type": "image"}, | ||
{"type": "text", "text": prompt[0]}, # Currently accepting only 1 prompt | ||
], | ||
}, | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should we keep it here? can we keep it in the constants file?
QEfficient/cloud/infer.py
Outdated
nargs="?", | ||
type=str, | ||
) | ||
parser.add_argument("--img-size", "--img_size", default=None, type=int, required=False, help="Size of Image") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are taking image path and image url as kwarg its better to keep it consistent. There is no point in only providing img_size as arg.
bc60d47
to
76e863a
Compare
… validation page (quic#303) Signed-off-by: Abukhoyer Shaik <[email protected]>
This is just small fixes done for printing the `QEFFAutoModelForCausalLM`'s instance by changing the `__repr__(self)` method. Signed-off-by: Abukhoyer Shaik <[email protected]>
76e863a
to
8d99a93
Compare
Removing onnx_defer_loading flag which was originally removed in _[Removed onnx_defer_loading from Immutable Convertor Args. PR: 230]_ but got added back later in _[Mllama(single + dual) + InternVL(single) + Llava (single) PR: 267]_ maybe becausing of rebasing. Signed-off-by: Shubham Agrawal <[email protected]> Signed-off-by: Asmita Goswami <[email protected]>
This will create a config JSON file, which contains all the details about compilation and SDK versions. Currently, this code is added in the code block of QEFFAutoModelForCausalLM.compile. The config would look like below: ``` { "huggingface_config": { "vocab_size": 50257, "n_positions": 1024, "n_embd": 768, "n_layer": 12, "n_head": 12, "n_inner": null, "activation_function": "gelu_new", "resid_pdrop": 0.1, "embd_pdrop": 0.1, "attn_pdrop": 0.1, "layer_norm_epsilon": 1e-05, "initializer_range": 0.02, "summary_type": "cls_index", "summary_use_proj": true, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "scale_attn_weights": true, "use_cache": true, "scale_attn_by_inverse_layer_idx": false, "reorder_and_upcast_attn": false, "bos_token_id": 50256, "eos_token_id": 50256, "return_dict": true, "output_hidden_states": false, "output_attentions": false, "torchscript": false, "torch_dtype": null, "use_bfloat16": false, "tf_legacy_loss": false, "pruned_heads": {}, "tie_word_embeddings": true, "chunk_size_feed_forward": 0, "is_encoder_decoder": false, "is_decoder": false, "cross_attention_hidden_size": null, "add_cross_attention": false, "tie_encoder_decoder": false, "max_length": 20, "min_length": 0, "do_sample": false, "early_stopping": false, "num_beams": 1, "num_beam_groups": 1, "diversity_penalty": 0.0, "temperature": 1.0, "top_k": 50, "top_p": 1.0, "typical_p": 1.0, "repetition_penalty": 1.0, "length_penalty": 1.0, "no_repeat_ngram_size": 0, "encoder_no_repeat_ngram_size": 0, "bad_words_ids": null, "num_return_sequences": 1, "output_scores": false, "return_dict_in_generate": false, "forced_bos_token_id": null, "forced_eos_token_id": null, "remove_invalid_values": false, "exponential_decay_length_penalty": null, "suppress_tokens": null, "begin_suppress_tokens": null, "architectures": [ "GPT2LMHeadModel" ], "finetuning_task": null, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "tokenizer_class": null, "prefix": null, "pad_token_id": null, "sep_token_id": null, "decoder_start_token_id": null, "task_specific_params": { "text-generation": { "do_sample": true, "max_length": 50 } }, "problem_type": null, "_name_or_path": "gpt2", "_commit_hash": "607a30d783dfa663caf39e06633721c8d4cfcd7e", "_attn_implementation_internal": "eager", "transformers_version": null, "model_type": "gpt2", "n_ctx": 1024 }, "qpc_config": { "QEff_config": { "pytorch_transforms": [ "AwqToMatmulNbitsTransform", "GPTQToMatmulNbitsTransform", "CustomOpsTransform", "KVCacheTransform" ], "onnx_transforms": [ "FP16ClipTransform", "SplitTensorsTransform" ], "onnx_path": "/root/.cache/qeff_models/GPT2LMHeadModel-36f0eca92731bb47/GPT2LMHeadModel.onnx" }, "aic_compiler_config": { "apps_sdk_version": "1.20.0", "compile_dir": "/root/.cache/qeff_models/GPT2LMHeadModel-36f0eca92731bb47", "specializtions_file_path": "/root/.cache/qeff_models/GPT2LMHeadModel-36f0eca92731bb47/specializations.json", "prefill_seq_len": 32, "ctx_len": 128, "batch_size": 1, "full_batch_size": null, "num_devices": 1, "num_cores": 16, "mxfp6_matmul": false, "mxint8_kv_cache": false, "num_speculative_tokens": null }, "qnn_config": { "enable_qnn": true, "qnn_config_path": "QEfficient/compile/qnn_config.json", "product": "QAIRT", "os": { "Ubuntu": 22.04, "Windows": 11 }, "sdk_flavor": [ "aic" ], "version": "2.31.0", "build_id": "250109072054_3882", "qnn_backend_api_version": "2.18.0", "tensorflow": "2.10.1", "tflite": "2.3.0", "torch": "1.13.1", "onnx": "1.16.1", "onnxruntime": "1.17.1", "onnxsimplifier": "0.4.36", "android-ndk": "r26c", "platform": "AIC.1.20.0.14" } } } ``` Note: The code structure may change. --------- Signed-off-by: Abukhoyer Shaik <[email protected]> Signed-off-by: Asmita Goswami <[email protected]>
… validation page (quic#303) Signed-off-by: Abukhoyer Shaik <[email protected]> Signed-off-by: Asmita Goswami <[email protected]>
This is just small fixes done for printing the `QEFFAutoModelForCausalLM`'s instance by changing the `__repr__(self)` method. Signed-off-by: Abukhoyer Shaik <[email protected]> Signed-off-by: Asmita Goswami <[email protected]>
Signed-off-by: Asmita Goswami <[email protected]>
Signed-off-by: Asmita Goswami <[email protected]>
Signed-off-by: Asmita Goswami <[email protected]>
Signed-off-by: Asmita Goswami <[email protected]>
Signed-off-by: Asmita Goswami <[email protected]>
Signed-off-by: Asmita Goswami <[email protected]>
3165896
to
1608804
Compare
Closing this PR as this feature will be merged into mainline. |
Added support for enabling VLMs via CLI.
Sample command:
python -m QEfficient.cloud.infer --model_name meta-llama/Llama-3.2-11B-Vision-Instruct --batch_size 1 --prompt_len 32 --ctx_len 512 --num_cores 16 --device_group [0] --prompt "Descrive the image?" --mos 1 --allocator_dealloc_delay 1 --image_url https://i.etsystatic.com/8155076/r/il/0825c2/1594869823/il_fullxfull.1594869823_5x0w.jpg