It seems that the model is not loaded on NPU Memory #138

BigYellowTiger · 2024-11-07T13:12:28Z

Describe the bug
My CPU is Ultra 7 258v, and the system is Windows 11Home 24H2. I just tried running the qwen2.5-7b-instruct-model using your example code for the first time. However, I noticed through the Task Manager that the model does not seem to be loaded into NPU memory (both NPU memory and GPU memory utilization remain at 0%), but instead, it is loaded into RAM. Additionally, the subsequent inference process seems very slow, approximately 1 second per token. Below are my code and Task Manager screenshots. Is this the expected behavior, or is there something in the example code that needs to be modified?

`from torch.profiler import profile, ProfilerActivity
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
from threading import Thread
import intel_npu_acceleration_library
import torch

model_id = "C:/all_project/all_llm_model/qwen2.5_7b_instruct/"

model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True)
tokenizer.pad_token_id = tokenizer.eos_token_id
streamer = TextStreamer(tokenizer, skip_special_tokens=True)

print("Compile model for the NPU")
model = intel_npu_acceleration_library.compile(model, dtype=torch.int8)

print("Run inference")

query = input("user: ")
prefix = tokenizer(query, return_tensors="pt")["input_ids"]

generation_kwargs = dict(
max_new_tokens=1000,
input_ids=prefix,
streamer=streamer,
do_sample=True,
top_k=50,
top_p=0.9,
)
_ = model.generate(**generation_kwargs)`

Screenshots

Desktop (please complete the following information):

OS: Windows 11Home 24H2

BigYellowTiger · 2024-11-07T14:10:46Z

I just tested it, and when running qwen1.5b on the NPU, the inference speed is approximately 0.26 seconds per token. However, when running qwen1.5b purely on the CPU, the inference speed is 0.13 seconds per token. So, does this mean that NPU inference is actually slower than pure CPU inference?

RedZh · 2024-12-03T07:02:14Z

I just tested it, and when running qwen1.5b on the NPU, the inference speed is approximately 0.26 seconds per token. However, when running qwen1.5b purely on the CPU, the inference speed is 0.13 seconds per token. So, does this mean that NPU inference is actually slower than pure CPU inference?

May I know how did you solve the problem? Thanks! I cannot load it to NPU as well

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It seems that the model is not loaded on NPU Memory #138

It seems that the model is not loaded on NPU Memory #138

BigYellowTiger commented Nov 7, 2024

BigYellowTiger commented Nov 7, 2024

RedZh commented Dec 3, 2024

It seems that the model is not loaded on NPU Memory #138

It seems that the model is not loaded on NPU Memory #138

Comments

BigYellowTiger commented Nov 7, 2024

BigYellowTiger commented Nov 7, 2024

RedZh commented Dec 3, 2024