Enabled VLMs via CLI on v1.19.3 #297

asmigosw · 2025-03-03T10:41:13Z

Added support for enabling VLMs via CLI.

Sample command:

python -m QEfficient.cloud.infer --model_name meta-llama/Llama-3.2-11B-Vision-Instruct --batch_size 1 --prompt_len 32 --ctx_len 512 --num_cores 16 --device_group [0] --prompt "Descrive the image?" --mos 1 --allocator_dealloc_delay 1 --image_url https://i.etsystatic.com/8155076/r/il/0825c2/1594869823/il_fullxfull.1594869823_5x0w.jpg

Removing onnx_defer_loading flag which was originally removed in _[Removed onnx_defer_loading from Immutable Convertor Args. PR: 230]_ but got added back later in _[Mllama(single + dual) + InternVL(single) + Llava (single) PR: 267]_ maybe becausing of rebasing. Signed-off-by: Shubham Agrawal <[email protected]>

ochougul · 2025-03-03T14:45:13Z

QEfficient/base/common.py

-            model_class = QEFFAutoModelForCausalLM
+        class_name = MODEL_CLASS_MAPPING.get(architecture)
+        if class_name:
+            module = importlib.import_module("QEfficient.transformers.models.modeling_auto")


do we need to use importlib here?
Is it causing circular import without it?

ochougul · 2025-03-03T14:46:23Z

QEfficient/base/common.py

+MODEL_CLASS_MAPPING = {}
+for architecture in mapping.MODEL_FOR_CAUSAL_LM_MAPPING_NAMES.values():
+    MODEL_CLASS_MAPPING[architecture] = "QEFFAutoModelForCausalLM"
+
+for architecture in mapping.MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES.values():
+    MODEL_CLASS_MAPPING[architecture] = "QEFFAutoModelForImageTextToText"


can we resort to simple technique like if the architecture name has suffix lm_head or causalLM it's text only.
instead if it has ConditionalGeneration it's image-text.
We can take a look at all suffixes in the above maps and decide this logic?

This will generate ambiguity as even in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES we have architecture which have ConditionalGeneration in their names, eg : https://github.com/huggingface/transformers/blob/6966fa190172b48b2fb46fe4552a13b943e692cf/src/transformers/models/auto/modeling_auto.py#L523

Also, there are many different architecture names in MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES, therefore mapping to all these names would not be possible as in future more architectures might add with different names.

ochougul · 2025-03-03T14:48:01Z

QEfficient/cloud/infer.py

@@ -110,20 +117,70 @@ def main(
        allow_mxint8_mdp_io=allow_mxint8_mdp_io,
        enable_qnn=enable_qnn,
        qnn_config=qnn_config,
+        img_size=img_size,


Doesn't this fail when we pass image_size and the model is causalLM?

We can choose to drop such extra params in case of causalLM like image_url , img_size etc.

Yes, this condition was discussed on original PR #287, we have popped out the image_size before passing it to compile of CausalLM models.

ochougul · 2025-03-03T14:51:26Z

QEfficient/cloud/infer.py

+        else:
+            raise FileNotFoundError(
+                'Neither Image URL nor Image Path is found, either provide "image_url" or "image_path"'
+            )


You can add a nor condition and fail there. also if both are passed, we should issue warning saying one of them will be ignored.

ochougul · 2025-03-03T14:52:00Z

QEfficient/cloud/infer.py

+        conversation = [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "image"},
+                    {"type": "text", "text": prompt[0]},  # Currently accepting only 1 prompt
+                ],
+            },
+        ]


Is this common for any image-text model. at least for the ones that we support?

Also should we keep it here? can we keep it in the constants file?

Yes, this works for both Llava and MLlama.

ochougul · 2025-03-03T14:52:56Z

QEfficient/transformers/models/modeling_auto.py

+        vision_onnx_path = compiler_options.get("vision_onnx_path", None)
+        lang_onnx_path = compiler_options.get("lang_onnx_path", None)


why? aren't those already parameters?

These are params, but if user passed them then it will overwrite them with user's input

ochougul · 2025-03-03T14:53:18Z

QEfficient/utils/constants.py

@@ -136,7 +136,6 @@ class QnnConstants:
        "--float_bitwidth ",
        "--preserve_io_datatype",
        "--onnx_skip_simplification",
-        "--onnx_defer_loading",


why is this being changed in this PR?

This will create a config JSON file, which contains all the details about compilation and SDK versions. Currently, this code is added in the code block of QEFFAutoModelForCausalLM.compile. The config would look like below: ``` { "huggingface_config": { "vocab_size": 50257, "n_positions": 1024, "n_embd": 768, "n_layer": 12, "n_head": 12, "n_inner": null, "activation_function": "gelu_new", "resid_pdrop": 0.1, "embd_pdrop": 0.1, "attn_pdrop": 0.1, "layer_norm_epsilon": 1e-05, "initializer_range": 0.02, "summary_type": "cls_index", "summary_use_proj": true, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "scale_attn_weights": true, "use_cache": true, "scale_attn_by_inverse_layer_idx": false, "reorder_and_upcast_attn": false, "bos_token_id": 50256, "eos_token_id": 50256, "return_dict": true, "output_hidden_states": false, "output_attentions": false, "torchscript": false, "torch_dtype": null, "use_bfloat16": false, "tf_legacy_loss": false, "pruned_heads": {}, "tie_word_embeddings": true, "chunk_size_feed_forward": 0, "is_encoder_decoder": false, "is_decoder": false, "cross_attention_hidden_size": null, "add_cross_attention": false, "tie_encoder_decoder": false, "max_length": 20, "min_length": 0, "do_sample": false, "early_stopping": false, "num_beams": 1, "num_beam_groups": 1, "diversity_penalty": 0.0, "temperature": 1.0, "top_k": 50, "top_p": 1.0, "typical_p": 1.0, "repetition_penalty": 1.0, "length_penalty": 1.0, "no_repeat_ngram_size": 0, "encoder_no_repeat_ngram_size": 0, "bad_words_ids": null, "num_return_sequences": 1, "output_scores": false, "return_dict_in_generate": false, "forced_bos_token_id": null, "forced_eos_token_id": null, "remove_invalid_values": false, "exponential_decay_length_penalty": null, "suppress_tokens": null, "begin_suppress_tokens": null, "architectures": [ "GPT2LMHeadModel" ], "finetuning_task": null, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "tokenizer_class": null, "prefix": null, "pad_token_id": null, "sep_token_id": null, "decoder_start_token_id": null, "task_specific_params": { "text-generation": { "do_sample": true, "max_length": 50 } }, "problem_type": null, "_name_or_path": "gpt2", "_commit_hash": "607a30d783dfa663caf39e06633721c8d4cfcd7e", "_attn_implementation_internal": "eager", "transformers_version": null, "model_type": "gpt2", "n_ctx": 1024 }, "qpc_config": { "QEff_config": { "pytorch_transforms": [ "AwqToMatmulNbitsTransform", "GPTQToMatmulNbitsTransform", "CustomOpsTransform", "KVCacheTransform" ], "onnx_transforms": [ "FP16ClipTransform", "SplitTensorsTransform" ], "onnx_path": "/root/.cache/qeff_models/GPT2LMHeadModel-36f0eca92731bb47/GPT2LMHeadModel.onnx" }, "aic_compiler_config": { "apps_sdk_version": "1.20.0", "compile_dir": "/root/.cache/qeff_models/GPT2LMHeadModel-36f0eca92731bb47", "specializtions_file_path": "/root/.cache/qeff_models/GPT2LMHeadModel-36f0eca92731bb47/specializations.json", "prefill_seq_len": 32, "ctx_len": 128, "batch_size": 1, "full_batch_size": null, "num_devices": 1, "num_cores": 16, "mxfp6_matmul": false, "mxint8_kv_cache": false, "num_speculative_tokens": null }, "qnn_config": { "enable_qnn": true, "qnn_config_path": "QEfficient/compile/qnn_config.json", "product": "QAIRT", "os": { "Ubuntu": 22.04, "Windows": 11 }, "sdk_flavor": [ "aic" ], "version": "2.31.0", "build_id": "250109072054_3882", "qnn_backend_api_version": "2.18.0", "tensorflow": "2.10.1", "tflite": "2.3.0", "torch": "1.13.1", "onnx": "1.16.1", "onnxruntime": "1.17.1", "onnxsimplifier": "0.4.36", "android-ndk": "r26c", "platform": "AIC.1.20.0.14" } } } ``` Note: The code structure may change. --------- Signed-off-by: Abukhoyer Shaik <[email protected]>

quic-rishinr · 2025-03-03T15:15:54Z

QEfficient/cloud/infer.py

@@ -226,10 +283,11 @@ def main(
             Sample Config: QEfficient/compile/qnn_config.json",
    )
    parser.add_argument(
-        "qnn_config",
+        "--qnn_config",


In the current workflow qnn_config is taken as an optional argument. Changing it to a positional argument would disrupt the existing flow. Instead we should pass a constant value of True for --enable_qnn and remove the qnn_config argument. @shubhagr-quic any thoughts on it?

quic-rishinr · 2025-03-03T15:16:46Z

QEfficient/cloud/infer.py

+            device_ids=device_group,
+            generation_len=generation_len,
+        )
+        print(output)


We should not be printing it this way in infer. Can we set a verbose level and print it accordingly from Auto classes itself @ochougul @quic-amitraj ?

quic-rishinr · 2025-03-03T15:17:44Z

QEfficient/cloud/infer.py

+        conversation = [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "image"},
+                    {"type": "text", "text": prompt[0]},  # Currently accepting only 1 prompt
+                ],
+            },
+        ]


Also should we keep it here? can we keep it in the constants file?

quic-rishinr · 2025-03-03T15:25:21Z

QEfficient/cloud/infer.py

        nargs="?",
        type=str,
    )
+    parser.add_argument("--img-size", "--img_size", default=None, type=int, required=False, help="Size of Image")


if we are taking image path and image url as kwarg its better to keep it consistent. There is no point in only providing img_size as arg.

… validation page (quic#303) Signed-off-by: Abukhoyer Shaik <[email protected]>

This is just small fixes done for printing the `QEFFAutoModelForCausalLM`'s instance by changing the `__repr__(self)` method. Signed-off-by: Abukhoyer Shaik <[email protected]>

Removing onnx_defer_loading flag which was originally removed in _[Removed onnx_defer_loading from Immutable Convertor Args. PR: 230]_ but got added back later in _[Mllama(single + dual) + InternVL(single) + Llava (single) PR: 267]_ maybe becausing of rebasing. Signed-off-by: Shubham Agrawal <[email protected]> Signed-off-by: Asmita Goswami <[email protected]>

This will create a config JSON file, which contains all the details about compilation and SDK versions. Currently, this code is added in the code block of QEFFAutoModelForCausalLM.compile. The config would look like below: ``` { "huggingface_config": { "vocab_size": 50257, "n_positions": 1024, "n_embd": 768, "n_layer": 12, "n_head": 12, "n_inner": null, "activation_function": "gelu_new", "resid_pdrop": 0.1, "embd_pdrop": 0.1, "attn_pdrop": 0.1, "layer_norm_epsilon": 1e-05, "initializer_range": 0.02, "summary_type": "cls_index", "summary_use_proj": true, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "scale_attn_weights": true, "use_cache": true, "scale_attn_by_inverse_layer_idx": false, "reorder_and_upcast_attn": false, "bos_token_id": 50256, "eos_token_id": 50256, "return_dict": true, "output_hidden_states": false, "output_attentions": false, "torchscript": false, "torch_dtype": null, "use_bfloat16": false, "tf_legacy_loss": false, "pruned_heads": {}, "tie_word_embeddings": true, "chunk_size_feed_forward": 0, "is_encoder_decoder": false, "is_decoder": false, "cross_attention_hidden_size": null, "add_cross_attention": false, "tie_encoder_decoder": false, "max_length": 20, "min_length": 0, "do_sample": false, "early_stopping": false, "num_beams": 1, "num_beam_groups": 1, "diversity_penalty": 0.0, "temperature": 1.0, "top_k": 50, "top_p": 1.0, "typical_p": 1.0, "repetition_penalty": 1.0, "length_penalty": 1.0, "no_repeat_ngram_size": 0, "encoder_no_repeat_ngram_size": 0, "bad_words_ids": null, "num_return_sequences": 1, "output_scores": false, "return_dict_in_generate": false, "forced_bos_token_id": null, "forced_eos_token_id": null, "remove_invalid_values": false, "exponential_decay_length_penalty": null, "suppress_tokens": null, "begin_suppress_tokens": null, "architectures": [ "GPT2LMHeadModel" ], "finetuning_task": null, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "tokenizer_class": null, "prefix": null, "pad_token_id": null, "sep_token_id": null, "decoder_start_token_id": null, "task_specific_params": { "text-generation": { "do_sample": true, "max_length": 50 } }, "problem_type": null, "_name_or_path": "gpt2", "_commit_hash": "607a30d783dfa663caf39e06633721c8d4cfcd7e", "_attn_implementation_internal": "eager", "transformers_version": null, "model_type": "gpt2", "n_ctx": 1024 }, "qpc_config": { "QEff_config": { "pytorch_transforms": [ "AwqToMatmulNbitsTransform", "GPTQToMatmulNbitsTransform", "CustomOpsTransform", "KVCacheTransform" ], "onnx_transforms": [ "FP16ClipTransform", "SplitTensorsTransform" ], "onnx_path": "/root/.cache/qeff_models/GPT2LMHeadModel-36f0eca92731bb47/GPT2LMHeadModel.onnx" }, "aic_compiler_config": { "apps_sdk_version": "1.20.0", "compile_dir": "/root/.cache/qeff_models/GPT2LMHeadModel-36f0eca92731bb47", "specializtions_file_path": "/root/.cache/qeff_models/GPT2LMHeadModel-36f0eca92731bb47/specializations.json", "prefill_seq_len": 32, "ctx_len": 128, "batch_size": 1, "full_batch_size": null, "num_devices": 1, "num_cores": 16, "mxfp6_matmul": false, "mxint8_kv_cache": false, "num_speculative_tokens": null }, "qnn_config": { "enable_qnn": true, "qnn_config_path": "QEfficient/compile/qnn_config.json", "product": "QAIRT", "os": { "Ubuntu": 22.04, "Windows": 11 }, "sdk_flavor": [ "aic" ], "version": "2.31.0", "build_id": "250109072054_3882", "qnn_backend_api_version": "2.18.0", "tensorflow": "2.10.1", "tflite": "2.3.0", "torch": "1.13.1", "onnx": "1.16.1", "onnxruntime": "1.17.1", "onnxsimplifier": "0.4.36", "android-ndk": "r26c", "platform": "AIC.1.20.0.14" } } } ``` Note: The code structure may change. --------- Signed-off-by: Abukhoyer Shaik <[email protected]> Signed-off-by: Asmita Goswami <[email protected]>

… validation page (quic#303) Signed-off-by: Abukhoyer Shaik <[email protected]> Signed-off-by: Asmita Goswami <[email protected]>

This is just small fixes done for printing the `QEFFAutoModelForCausalLM`'s instance by changing the `__repr__(self)` method. Signed-off-by: Abukhoyer Shaik <[email protected]> Signed-off-by: Asmita Goswami <[email protected]>

Signed-off-by: Asmita Goswami <[email protected]>

quic-rishinr · 2025-03-12T15:09:39Z

Closing this PR as this feature will be merged into mainline.

asmigosw requested review from quic-rishinr and ochougul as code owners March 3, 2025 10:41

asmigosw force-pushed the image_text_support branch from 9567658 to ad06845 Compare March 3, 2025 11:50

asmigosw changed the title ~~Image text support~~ Enabled VLMs via CLI Mar 3, 2025

asmigosw changed the title ~~Enabled VLMs via CLI~~ Enabled VLMs via CLI on v1.19.3 Mar 3, 2025

ochougul requested changes Mar 3, 2025

View reviewed changes

quic-rishinr requested changes Mar 3, 2025

View reviewed changes

asmigosw force-pushed the image_text_support branch from bc60d47 to 76e863a Compare March 6, 2025 06:02

abukhoy added 2 commits March 6, 2025 11:56

Docs string added for the Image class and granite models are added in…

01dffb6

… validation page (quic#303) Signed-off-by: Abukhoyer Shaik <[email protected]>

[Bug-Fix :] QEFFAutoModelForCausalLM __repr__() Method Fixed (quic#307)

d590081

This is just small fixes done for printing the `QEFFAutoModelForCausalLM`'s instance by changing the `__repr__(self)` method. Signed-off-by: Abukhoyer Shaik <[email protected]>

asmigosw force-pushed the image_text_support branch from 76e863a to 8d99a93 Compare March 10, 2025 07:26

shubhagr-quic and others added 10 commits March 10, 2025 07:43

Docs string added for the Image class and granite models are added in…

687d44f

… validation page (quic#303) Signed-off-by: Abukhoyer Shaik <[email protected]> Signed-off-by: Asmita Goswami <[email protected]>

Enabled VLMs via CLI

691cca4

Signed-off-by: Asmita Goswami <[email protected]>

Addressing comments

ea8555d

Signed-off-by: Asmita Goswami <[email protected]>

Removed importlib

5ea6f1c

Signed-off-by: Asmita Goswami <[email protected]>

Addressing comments

561142b

Signed-off-by: Asmita Goswami <[email protected]>

Addressing comments

d9dc7d2

Signed-off-by: Asmita Goswami <[email protected]>

Resolved merge conflicts

1608804

Signed-off-by: Asmita Goswami <[email protected]>

asmigosw force-pushed the image_text_support branch from 3165896 to 1608804 Compare March 10, 2025 07:48

quic-rishinr closed this Mar 12, 2025

		vision_onnx_path = compiler_options.get("vision_onnx_path", None)
		lang_onnx_path = compiler_options.get("lang_onnx_path", None)

Enabled VLMs via CLI on v1.19.3 #297

Enabled VLMs via CLI on v1.19.3 #297

Uh oh!

Conversation

asmigosw commented Mar 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

quic-rishinr commented Mar 12, 2025

Uh oh!

Uh oh!