You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
My CPU is Ultra 7 258v, and the system is Windows 11Home 24H2. I just tried running the qwen2.5-7b-instruct-model using your example code for the first time. However, I noticed through the Task Manager that the model does not seem to be loaded into NPU memory (both NPU memory and GPU memory utilization remain at 0%), but instead, it is loaded into RAM. Additionally, the subsequent inference process seems very slow, approximately 1 second per token. Below are my code and Task Manager screenshots. Is this the expected behavior, or is there something in the example code that needs to be modified?
`from torch.profiler import profile, ProfilerActivity
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
from threading import Thread
import intel_npu_acceleration_library
import torch
I just tested it, and when running qwen1.5b on the NPU, the inference speed is approximately 0.26 seconds per token. However, when running qwen1.5b purely on the CPU, the inference speed is 0.13 seconds per token. So, does this mean that NPU inference is actually slower than pure CPU inference?
I just tested it, and when running qwen1.5b on the NPU, the inference speed is approximately 0.26 seconds per token. However, when running qwen1.5b purely on the CPU, the inference speed is 0.13 seconds per token. So, does this mean that NPU inference is actually slower than pure CPU inference?
May I know how did you solve the problem? Thanks! I cannot load it to NPU as well
Describe the bug
My CPU is Ultra 7 258v, and the system is Windows 11Home 24H2. I just tried running the qwen2.5-7b-instruct-model using your example code for the first time. However, I noticed through the Task Manager that the model does not seem to be loaded into NPU memory (both NPU memory and GPU memory utilization remain at 0%), but instead, it is loaded into RAM. Additionally, the subsequent inference process seems very slow, approximately 1 second per token. Below are my code and Task Manager screenshots. Is this the expected behavior, or is there something in the example code that needs to be modified?
`from torch.profiler import profile, ProfilerActivity
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
from threading import Thread
import intel_npu_acceleration_library
import torch
model_id = "C:/all_project/all_llm_model/qwen2.5_7b_instruct/"
model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True)
tokenizer.pad_token_id = tokenizer.eos_token_id
streamer = TextStreamer(tokenizer, skip_special_tokens=True)
print("Compile model for the NPU")
model = intel_npu_acceleration_library.compile(model, dtype=torch.int8)
print("Run inference")
query = input("user: ")
prefix = tokenizer(query, return_tensors="pt")["input_ids"]
generation_kwargs = dict(
max_new_tokens=1000,
input_ids=prefix,
streamer=streamer,
do_sample=True,
top_k=50,
top_p=0.9,
)
_ = model.generate(**generation_kwargs)`
Screenshots

Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: