Skip to content

QNN Compilation path Support in QEFFBaseModel class. #335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 32 additions & 16 deletions QEfficient/base/modeling_qeff.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,11 @@ def compile(self, *args, **kwargs) -> Path:
:num_cores (int): Number of cores to utilize in each device ``Defaults to 16``.
:mxfp6_matmul (bool): Use MXFP6 to compress weights for MatMul nodes to run faster on device. ``Defaults to False``.
:mxint8_kv_cache (bool): Use MXINT8 to compress KV-cache on device to access and update KV-cache faster. ``Defaults to False``.
:compiler_options: Pass any compiler option as input. Any flag that is supported by ``qaic-exec`` can be passed. Params are converted to flags as below:
:compiler_options: Pass any compiler option as input.
Following flag can be passed in compiler_options to enable QNN Compilation path.
:enable_qnn (bool): Enables QNN Compilation. ``Defaults to False. if not passed.``
:qnn_config (str): Path of QNN Config parameters file. ``Defaults to None. if not passed``
for QAIC compilation path, any flag that is supported by ``qaic-exec`` can be passed. Params are converted to flags as below:
- aic_num_cores=16 -> -aic-num-cores=16
- convert_to_fp16=True -> -convert-to-fp16

Expand Down Expand Up @@ -217,10 +221,13 @@ def _compile(
onnx_path: Optional[str] = None,
compile_dir: Optional[str] = None,
*,
mxint8_kv_cache: bool = False,
specializations: Optional[List[Dict[str, int]]] = None,
custom_io: Optional[Dict[str, str]] = None,
mdp_ts_num_devices: int = 1,
num_speculative_tokens: Optional[int] = None,
enable_qnn: Optional[bool] = False,
qnn_config: Optional[str] = None,
**compiler_options,
) -> str:
"""
Expand All @@ -229,14 +236,31 @@ def _compile(
Args:
:onnx_path (str): Onnx file to compile
:compile_dir (str): Directory path to compile the qpc. A suffix is added to the directory path to avoid reusing same qpc for different parameters.
:mxint8_kv_cache (bool, optional): Whether to use ``mxint8`` compression for KV cache. ``Defaults to False``.
:specializations (list): List of specializations to compile for
:custom_io (dict): Custom IO to specify the input and outputs in different formats than default
:mdp_ts_num_devices (int): Number of devices to partition to use Multi-Device Partitioning with tensor-slicing.
:num_speculative_tokens (int, optional): Number of speculative tokens to take as input for Speculative Decoding Target Language Model.
:enable_qnn (bool): Enables QNN Compilation. ``Defaults to False.``
:qnn_config (str): Path of QNN Config parameters file. ``Defaults to None.``
:compiler_options: Pass any compiler option as input. Any flag that is supported by `qaic-exec` can be passed. Params are converted to flags as below:
- aic_num_cores=16 -> -aic-num-cores=16
- convert_to_fp16=True -> -convert-to-fp16

"""
if enable_qnn:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to have two different python file qnn_compiler.py and another qaic_compiler.py, which will have all the code of compilation on both the different sdks. _compile method will route to the compilation based on passed parameters.@quic-rishinr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Lets take it as a separate change post merging this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quic-amitraj
Will take up the change to move _qnn_compile functionality to qnn_compile as a separate change because of the following reason:
qnn_compile function in qnn_compiler.py is also used by CLI commands for compilation which dumps qpcs into folder with name qpc_base_path/qpcs and qpcs compiled via QEFFBaseModel class should be dumped in qpc folder with config hash. Moving config hash creation in qnn_compile function will force CLI commands compilation as well to follow same directory structure and will fail CLI unit testing where the qpc folder to check is hard coded as "qpcs".

return self._qnn_compile(
onnx_path,
compile_dir,
specializations=specializations,
custom_io=custom_io,
mdp_ts_num_devices=mdp_ts_num_devices,
num_cores=compiler_options.get("aic_num_cores", 16),
mxfp6_matmul=compiler_options.get("mxfp6_matmul", False),
mxint8_kv_cache=mxint8_kv_cache,
qnn_config=qnn_config,
)

if onnx_path is None and self.onnx_path is None:
self.export()

Expand Down Expand Up @@ -346,35 +370,27 @@ def _qnn_compile(
onnx_path: Optional[str] = None,
compile_dir: Optional[str] = None,
*,
custom_io: Optional[Dict[str, str]] = None,
Copy link
Contributor

@quic-amitraj quic-amitraj Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After addressing then above comment there will be no need of this method to be here. All the preprocessing to compilation should be done inside the their specific file. This way we will have clean code and easier to understand.

specializations: Optional[List[Dict[str, int]]] = None,
prefill_seq_len: int = 32,
ctx_len: int = 128,
batch_size: int = 1,
full_batch_size: Optional[int] = None,
mdp_ts_num_devices: int = 1,
num_cores: int = 16,
mxfp6_matmul: bool = False,
mxint8_kv_cache: bool = False,
qnn_config: Optional[str] = None,
kv_cache_batch_size: Optional[int] = None,
) -> str:
"""
Interface for QNN compiler

Args:
:onnx_path (str): Onnx file to compile
:compile_dir (str): Directory path to compile the qpc. A suffix is added to the directory path to avoid reusing same qpc for different parameters.
:custom_io (dict): Custom IO to specify the input and outputs in different formats than default
:specializations (list): List of specializations to compile for
:prefill_seq_len (int, optional): The length of the Prefill prompt should be less that ``prefill_seq_len``. ``Defaults to 32``.
:ctx_len (int, optional): Maximum ``ctx`` that the compiled model can remember. ``Defaults to 128``.
:batch_size (int, optional): Batch size. ``Defaults to 1``.
:full_batch_size (int, optional): Continuous batching batch size.
:mdp_ts_num_devices (int): Number of devices to partition to use Multi-Device Partitioning with tensor-slicing.
:num_cores (int): Number of cores used to compile the model.
:mxfp6_matmul (bool, optional): Whether to use ``mxfp6`` compression for weights. ``Defaults to True``.
:mxint8_kv_cache (bool, optional): Whether to use ``mxint8`` compression for KV cache. ``Defaults to False``.
:qnn_config (str): Path of QNN Config parameters file. ``Defaults to None.``
:kv_cache_batch_size (int): kv_cache_batch_size for Prefix Caching. ``Defaults to None.``
"""
if onnx_path is None and self.onnx_path is None:
self.export()
Expand All @@ -390,6 +406,9 @@ def _qnn_compile(
if specializations is not None:
compile_hash.update(to_hashable(specializations))

if custom_io is not None:
compile_hash.update(to_hashable(custom_io))

if qnn_config is not None:
qnn_config_values = load_json(qnn_config)
compile_hash.update(to_hashable(qnn_config_values))
Expand Down Expand Up @@ -426,15 +445,12 @@ def _qnn_compile(
qpc_base_path=compile_dir,
num_cores=num_cores,
device_group=list(range(mdp_ts_num_devices)),
batch_size=batch_size,
prompt_len=prefill_seq_len,
ctx_len=ctx_len,
mxfp6=mxfp6_matmul,
mxint8=mxint8_kv_cache,
full_batch_size=full_batch_size,
qnn_config=qnn_config,
qnn_binary_dir=qpc_path,
kv_cache_batch_size=kv_cache_batch_size,
specializations=specializations,
custom_io=custom_io,
)

self.qpc_path = qpc_path
Expand Down
29 changes: 14 additions & 15 deletions QEfficient/compile/compile_helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from typing import List, Optional, Tuple

from QEfficient.compile.qnn_compiler import compile as qnn_compile
from QEfficient.utils._utils import load_json, load_yaml
from QEfficient.utils.logging_utils import logger


Expand Down Expand Up @@ -180,36 +181,34 @@ def compile(
full_batch_size=full_batch_size,
)

# Select the customIO config based on the mx flag.
custom_io_file_name = "custom_io_int8.yaml" if mxint8 else "custom_io_fp16.yaml"

if custom_io_file_path is None:
custom_io_file_path = os.path.join(os.path.dirname(onnx_path), custom_io_file_name)

if not os.path.isfile(custom_io_file_path):
raise FileNotFoundError(
f"Custom IO file {custom_io_file_name} is not present at the expected path {custom_io_file_path}. Please pass the correct file path or rerun infer/export API"
)

if enable_qnn:
qpc_path = qnn_compile(
onnx_path=onnx_path,
qpc_base_path=qpc_path,
num_cores=num_cores,
batch_size=batch_size,
prompt_len=prompt_len,
ctx_len=ctx_len,
mxfp6=mxfp6,
mxint8=mxint8,
allow_mxint8_mdp_io=allow_mxint8_mdp_io,
aic_enable_depth_first=aic_enable_depth_first,
mos=mos,
device_group=device_group,
full_batch_size=full_batch_size,
qnn_config=qnn_config,
specializations=(load_json(specialization_json_path))["specializations"],
custom_io=load_yaml(custom_io_file_path),
)
logger.info(f"QNN Compiled QPC files can be found here: {qpc_path}")
else:
# Select the customIO config based on the mx flag.
custom_io_file_name = "custom_io_int8.yaml" if mxint8 else "custom_io_fp16.yaml"

if custom_io_file_path is None:
custom_io_file_path = os.path.join(os.path.dirname(onnx_path), custom_io_file_name)

if not os.path.isfile(custom_io_file_path):
raise FileNotFoundError(
f"Custom IO file {custom_io_file_name} is not present at the expected path {custom_io_file_path}. Please pass the correct file path or rerun infer/export API"
)

_, qpc_path = compile_kv_model_on_cloud_ai_100(
onnx_path=onnx_path,
specializations_json=specialization_json_path,
Expand Down
50 changes: 19 additions & 31 deletions QEfficient/compile/qnn_compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,14 @@

import os
import shutil
from typing import List, Optional
from typing import Dict, List, Optional

from QEfficient.utils._utils import create_json, execute_command, load_json
from QEfficient.utils.constants import QnnConstants
from QEfficient.utils.generate_qnn_network_specialization_config import fetch_nodes_info, generate_data_format_config
from QEfficient.utils.generate_qnn_network_specialization_config import (
generate_data_format_config,
generate_qnn_specialization,
)
from QEfficient.utils.logging_utils import logger


Expand All @@ -31,15 +34,13 @@ def __init__(
device_group: Optional[List[int]] = None,
compiler_enable_depth_first: bool = False,
compiler_max_out_channel_split: int = -1,
batch_size: int = 1,
prompt_len: int = 32,
ctx_len: int = 128,
compiler_mxfp6_matmul_weights: bool = True,
qnn_target: str = QnnConstants.TARGET,
qnn_config_path: Optional[str] = None,
qnn_binary_dir: Optional[str] = None,
mxint8: Optional[bool] = False,
compiler_mxint8_mdp_io: Optional[bool] = False,
prefill_only: Optional[bool] = False,
**kwargs,
) -> None:
self.onnx_path = onnx_path
Expand All @@ -48,9 +49,6 @@ def __init__(
self.device_group = device_group
self.compiler_enable_depth_first = compiler_enable_depth_first
self.compiler_max_out_channel_split = compiler_max_out_channel_split
self.batch_size = batch_size
self.prompt_len = prompt_len
self.ctx_len = ctx_len
self.compiler_mxfp6_matmul_weights = compiler_mxfp6_matmul_weights
self.qnn_config_path = qnn_config_path
self.qnn_binary_dir = qnn_binary_dir
Expand All @@ -59,6 +57,7 @@ def __init__(
self.custom_io_path = custom_io_path
self.dlc_model_path = os.path.join(qpc_base_path, f"{QnnConstants.MODEL_NAME}.dlc")
self.qnn_target = qnn_target
self.prefill_only = prefill_only
self.qnn_sdk_path = os.getenv(QnnConstants.QNN_SDK_PATH_ENV_VAR_NAME)
if not self.qnn_sdk_path:
raise EnvironmentError(
Expand Down Expand Up @@ -141,7 +140,7 @@ def create_qnn_compile_backend_json(self) -> str:
"compiler_hardware_version": QnnConstants.COMPILER_HARDWARE_VERSION,
"compiler_convert_to_FP16": QnnConstants.COMPILER_CONVERT_TO_FP16,
"compiler_retained_state": QnnConstants.COMPILER_RETAINED_STATE,
"graph_names": QnnConstants.GRAPH_NAMES,
"graph_names": QnnConstants.GRAPH_NAMES_PREFILL_ONLY if self.prefill_only else QnnConstants.GRAPH_NAMES,
"compiler_enable_depth_first": self.compiler_enable_depth_first,
"compiler_mxfp6_matmul_weights": self.compiler_mxfp6_matmul_weights,
"compiler_num_of_cores": self.num_cores,
Expand Down Expand Up @@ -327,16 +326,13 @@ def compile(
device_group: Optional[List[int]] = None,
aic_enable_depth_first: bool = False,
mos: int = -1,
batch_size: int = 1,
prompt_len: int = 32,
ctx_len: int = 128,
mxfp6: bool = True,
mxint8: bool = False,
allow_mxint8_mdp_io: Optional[bool] = False,
full_batch_size=None,
qnn_config: Optional[str] = None,
qnn_binary_dir: Optional[str] = None,
kv_cache_batch_size: Optional[int] = None,
custom_io: Optional[Dict[str, str]] = None,
specializations: Optional[List[Dict[str, int]]] = None,
**kwargs,
) -> str:
"""
Expand All @@ -352,16 +348,13 @@ def compile(
:device_group (List[int]): Used for finding the number of devices to compile for.
:aic_enable_depth_first (bool): Enables ``DFS`` with default memory size. ``Defaults to False.``
:mos (int): Effort level to reduce the on-chip memory. ``Defaults to -1.``
:batch_size (int): Batch size to compile the model for. ``Defaults to 1.``
:full_batch_size (int): Set full batch size to enable continuous batching mode. ``Default to None``
:prompt_len (int): Prompt length for the model to compile. ``Defaults to 32``
:ctx_len (int): Maximum context length to compile the model. ``Defaults to 128``
:mxfp6 (bool): Enable compilation for ``MXFP6`` precision. ``Defaults to True.``
:allow_mxint8_mdp_io (bool): Allows MXINT8 compression of MDP IO traffic ``Defaults to False.``
:mxint8 (bool): Compress Present/Past KV to ``MXINT8`` using ``CustomIO`` config. ``Defaults to False.``
:allow_mxint8_mdp_io (bool): Allows MXINT8 compression of MDP IO traffic ``Defaults to False.``
:qnn_config (str): Path to ``qnn_config.json`` file (formatted as a string). ``Defaults to None.``
:qnn_binary_dir (str): Path for saving qnn binaries.
:kv_cache_batch_size (int): kv_cache_batch_size for Prefix Caching. ``Defaults to None.``
:custom_io (dict): Custom IO to specify the input and outputs in different formats than default
:specializations (list): List of specializations to compile for

Returns:
:str: Path to compiled ``qpc`` package.
Expand All @@ -377,23 +370,20 @@ def compile(
# TODO To make custom_io_config.yaml configurable as not all models need it.
custom_io_file_path = os.path.join(qpc_base_path, "custom_io_config.yaml")

kv_precision = "uint8" if mxint8 else "float16"
fetch_nodes_info(
generate_qnn_specialization(
onnx_graph_path=onnx_path,
batch_size=batch_size,
sequence_length=prompt_len,
context_length=ctx_len,
specializations=specializations,
custom_io=custom_io,
file_path=custom_io_file_path,
full_batch_size=full_batch_size,
kv_precision=kv_precision,
kv_cache_batch_size=kv_cache_batch_size,
)

if not os.path.isfile(custom_io_file_path):
raise FileNotFoundError(
f"file {custom_io_file_path} needs to exist in the qpc_base_path for Compilation. Please rerun infer/compile Api"
)

prefill_only = True if len(specializations) == 1 else False

qnn_obj = QNN(
onnx_path=onnx_path,
qpc_base_path=qpc_base_path,
Expand All @@ -403,13 +393,11 @@ def compile(
custom_io_path=custom_io_file_path,
compiler_enable_depth_first=aic_enable_depth_first,
compiler_max_out_channel_split=mos,
batch_size=batch_size,
prompt_len=prompt_len,
ctx_len=ctx_len,
compiler_mxfp6_matmul_weights=mxfp6,
qnn_binary_dir=qnn_binary_dir,
mxint8=mxint8,
compiler_mxint8_mdp_io=allow_mxint8_mdp_io,
prefill_only=prefill_only,
)

compiled_binary_path = qnn_obj.compile()
Expand Down
1 change: 1 addition & 0 deletions QEfficient/peft/auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,7 @@ def compile(
custom_io=custom_io,
mdp_ts_num_devices=num_devices,
aic_num_cores=num_cores,
mxint8_kv_cache=mxint8_kv_cache,
**compiler_options,
)

Expand Down
Loading
Loading