Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It seems that the model is not loaded on NPU Memory #138

Open
BigYellowTiger opened this issue Nov 7, 2024 · 2 comments
Open

It seems that the model is not loaded on NPU Memory #138

BigYellowTiger opened this issue Nov 7, 2024 · 2 comments

Comments

@BigYellowTiger
Copy link

Describe the bug
My CPU is Ultra 7 258v, and the system is Windows 11Home 24H2. I just tried running the qwen2.5-7b-instruct-model using your example code for the first time. However, I noticed through the Task Manager that the model does not seem to be loaded into NPU memory (both NPU memory and GPU memory utilization remain at 0%), but instead, it is loaded into RAM. Additionally, the subsequent inference process seems very slow, approximately 1 second per token. Below are my code and Task Manager screenshots. Is this the expected behavior, or is there something in the example code that needs to be modified?

`from torch.profiler import profile, ProfilerActivity
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
from threading import Thread
import intel_npu_acceleration_library
import torch

model_id = "C:/all_project/all_llm_model/qwen2.5_7b_instruct/"

model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True)
tokenizer.pad_token_id = tokenizer.eos_token_id
streamer = TextStreamer(tokenizer, skip_special_tokens=True)

print("Compile model for the NPU")
model = intel_npu_acceleration_library.compile(model, dtype=torch.int8)

print("Run inference")

query = input("user: ")
prefix = tokenizer(query, return_tensors="pt")["input_ids"]

generation_kwargs = dict(
max_new_tokens=1000,
input_ids=prefix,
streamer=streamer,
do_sample=True,
top_k=50,
top_p=0.9,
)
_ = model.generate(**generation_kwargs)`

Screenshots
2fab31c0-4df2-4e51-812b-3cff5c7cead2

Desktop (please complete the following information):

  • OS: Windows 11Home 24H2
@BigYellowTiger
Copy link
Author

I just tested it, and when running qwen1.5b on the NPU, the inference speed is approximately 0.26 seconds per token. However, when running qwen1.5b purely on the CPU, the inference speed is 0.13 seconds per token. So, does this mean that NPU inference is actually slower than pure CPU inference?

@RedZh
Copy link

RedZh commented Dec 3, 2024

I just tested it, and when running qwen1.5b on the NPU, the inference speed is approximately 0.26 seconds per token. However, when running qwen1.5b purely on the CPU, the inference speed is 0.13 seconds per token. So, does this mean that NPU inference is actually slower than pure CPU inference?

May I know how did you solve the problem? Thanks! I cannot load it to NPU as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants