-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New SOTA model DeepSeek-R1-Qwen won't load #6679
Comments
Support for Deepseek-R1-Qwen tokenizer was only added to llama.cpp a few hours ago (ggerganov/llama.cpp@ec7f3ac). Will need to update the llama.cpp side of things. |
error loading model: error loading model vocabulary: unknown pre-tokenizer type: I installed all the latest packages, and still getting errors. I even specifically installed that version of llama.cpp directly (ec7f3ac), but that doesn't resolve text-generation-webui. I know I'm not doing something correct here, but specific steps would help. Is it more than just installing that commit? how I'd install
|
I am pretty sure that by default llama-cpp-python is installed from a wheel (as per https://github.com/oobabooga/text-generation-webui/blob/main/requirements.txt) which is pre-built and would thus need to be rebuilt (https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels) and requirements updated. |
Alternatively, you can try activating the installed conda environment and try to reinstall llama-cpp-python manually from https://pypi.org/project/llama-cpp-python/0.3.6/ with all the appropriate build flags. |
Launched a build action on my fork of https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels (https://github.com/hpnyaggerman/llama-cpp-python-cuBLAS-wheels/actions). Will take a while to finish. Currently for CUDA only. I hope the maintainers notice the issue soon and update llama-cpp-python in text-generation-webui, but until then the output of my forks action should be sufficient. |
Im confused how they even create these ggufs without llama.cpp being even
updated yet as it holds quantize
…On Tue, Jan 21, 2025, 12:33 AM hpnyaggerman ***@***.***> wrote:
Launched a build action on my fork of
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels. Will take a
while to finish. Currently for CUDA only. I hope the maintainers notice the
issue soon and update llama-cpp-python in text-generation-webui, but until
then the output of my forks action should be sufficient.
—
Reply to this email directly, view it on GitHub
<#6679 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHKKORSBNWO2BVKTAC3DET2LYA6LAVCNFSM6AAAAABVQ3GR4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBTHE3DQMBYHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Judging by the changes in the converter, I assume they simply add |
Yeah, it's a simple fix and it's easy to recompile if you're running locally, but less so in a complex assembly of dependencies that text-generation-webui has. |
So it sounds like qwen (pun intended) building llama-cpp-python we need to
link to llama.cpp build within the same folder as llama-cpp-python
…On Tue, Jan 21, 2025, 9:02 AM hpnyaggerman ***@***.***> wrote:
*"Im confused how they even create these ggufs without llama.cpp being
even updated yet as it holds quantize"*
Judging by the changes in the converter, I assume they simply add
tokenizer_pre from the new model themselves and proceed with the
conversion without any issues.
Yeah, it's a simple fix and it's easy to recompile if you're running
locally, but less so in a complex assembly of dependencies that
text-generation-webui has.
—
Reply to this email directly, view it on GitHub
<#6679 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHKKOWX4MWPKCZHP4G26ED2LZ4RZAVCNFSM6AAAAABVQ3GR4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBVGI3TSMZQGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I am almost done building llama-cpp-python for CUDA and Tensorcores (https://github.com/hpnyaggerman/llama-cpp-python-cuBLAS-wheels/actions). So if your system is similar to mine and requires those specific packages to infer with llama.cpp in the ui, you can just go to your cloned repository and |
1004 sed -i 's/https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels//https://github.com/hpnyaggerman/llama-cpp-python-cuBLAS-wheels//g' *.txt |
Unfortunately llama-cpp-python hasn't been receiving frequent updates, and easy-llama (a possible alternative) is not ready yet. To update llama-cpp-python manually, I use these commands:
|
Someone filed the issue at the llama repository. |
Yeah, I've realized the wheel building process does not actually involve using fresh source code but rather each llama-cpp-python version corresponds to a llama.cpp version, my bad. |
It's possible to fork llama-cpp-python and change abetlen/llama-cpp-python to yourusername/llama-cpp-python in the workflows, but in this particular update, there have been updates to the llama.cpp internals that require updates in the Python bindings definitions. That's the difficulty with maintaining llama.cpp bindings -- the internals change all the time. |
As a possible work around you can always download the safetensors (non quantized original) version of the distilled versions using the text-generation-webui under the models tab and in the 'Download model or LoRA' section on the right/bottom. Then load the model with transformers, and use Q8 or Q4 options if the model is too large for your graphics card. On my 4070ti it worked with Q8. It's a bit slower than other models of the same size, about 5 tokens per second. It's a very verbose model though so you better give it something to reason about, it's not suitable for roleplay for example, or a regular chat conversation. |
Kind of a sidebar comment, but is it correct that the exl2 quants are working? I'm running exl2 quants for the distils and they seem to be working perfectly with <think> </think> tags. There is however a small bug i've noticed with the "Start reply with" option. R1 seems to always begin its response with
so if you put something in the Start reply with box, then for some reason it overrides the models' initial response, and then produce a buggy response. However, if you prefix it with <think> then it injects correctly. e.g. Start reply with: "hello my name is zeus" Start reply with: "<think> hello my name is zeus" in the first case, sometimes R1 repeats itself randomly |
still waiting on the llama-cpp update... |
I find that the EXL2 versions works without issue for me if you need something in the meantime. |
I've installed llama-cpp-python from local as suggested (windows 11, cuda 12.6), following these steps:
But now it is failing the lib itself (see logs); any idea?
|
It may work if you install llama-cpp-python from this repository instead: |
Ollama released a ver of this model. I'll make due w that until text gen
updates
…On Fri, Jan 24, 2025, 10:10 AM oobabooga ***@***.***> wrote:
It may work if you install llama-cpp-python from this repository instead:
abetlen/llama-cpp-python#1901
<abetlen/llama-cpp-python#1901>
ie https://github.com/JamePeng/llama-cpp-python/tree/main
—
Reply to this email directly, view it on GitHub
<#6679 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHKKOQRHF3KTYW5SNGQXZT2MJ6ZZAVCNFSM6AAAAABVQ3GR4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJTGEYTKNBRGE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Thanks, @oobabooga. It sort of worked but ended up failing later. It almost finished loading the model, but now it's causing a memory access violation. Anyway, it no longer loads any GGUF with that version (always throwing the memory error). I'm sharing log for reference. For now, I'm successfully using the Transformers version, and it works fine. They don't get installed when building from source.
|
Tested this PR locally, try running llama-cpp-python openai compatible server and got segmentation fault on launch, I think there's still some bugs to be fix. |
I'm having the same issue |
@ljm625 : ) Try again! I had fixed it ! |
|
I'm on linux and was getting linker errors. Tracked it down to it using my system's I first had to get a 12.1 version of conda install -c "nvidia/label/cuda-12.1.1" cuda-toolkit
conda install gcc_linux-64==11.2.0 gxx_linux-64==11.2.0
cd ./installer_files/env/bin
ln -s x86_64-conda-linux-gnu-g++ g++
ln -s x86_64-conda-linux-gnu-ld ld
ln -s x86_64-conda-linux-gnu-gcc gcc Then the build worked for me. pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
git clone https://github.com/JamePeng/llama-cpp-python.git
cd llama-cpp-python/vendor
git clone https://github.com/ggerganov/llama.cpp
cd ..
CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install . --no-cache-dir --verbose |
pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores Does anyone know of a way to get this to work on a windows based machine? Does anyone know if an update to oobabooga is incoming? |
You have to activate the conda environment first. Run "conda env list". Then "conda activate" followed by the ENV conda environmente from the text generaion webui. After that you can use the commands above. You also need git installed. You may need compilers installed also, not sure.... |
Hello @maddog7667, If you want to compile the CUDA version of llama_cpp_python in windows environment that need some prepare operation firstly,
|
I'm having a problem that the installation/compiler process is spawning hundreds of ninja threads, causing the OS to kill all processes. There is a way to limit this? I searched, tried multiple flags, but without success. I'm unable to install due to that |
Same
Kills my device when building
…On Tue, Jan 28, 2025, 9:28 AM Hugo do Prado ***@***.***> wrote:
I'm having a problem that the installation/compiler process is spawning
hundreds of ninja threads, causing the OS to kill all processes. There is a
way to limit this? I searched, tried multiple flags, but without success.
I'm unable to install due to that
—
Reply to this email directly, view it on GitHub
<#6679 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHKKOSGUP6T25LQQ56TFDD2M643RAVCNFSM6AAAAABVQ3GR4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJZGYZTMOJSGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hi, thank you @JamePeng. I had the same problem, on W11, fixed it like this: Installed MSVC build tools Opened cmd_windows.bat and run:
EDIT: Celebrated too soon :-( Model is now loaded, but use CPU instead GPU... |
abetlen has adapted the new version of llama.cpp in llama-cpp-python, which is good. Maybe you can try again!! |
The same problem. The assembler doesn't seem to see the "-DGGML_CUDA=on" flag, since in the console I don't see the launch of the Cuda and tesorcores assembly itself or errors related to the absence of anything. |
@cerega66 Are there any logs or errors during the compilation process with the "-DGGML_CUDA=on" flag ? |
No. llama-cpp-python builds fine, but I can only use cpu. I looked through the entire log and found no mention of CUDA or tensercore. This is short log:
I can compile with -verboss. |
@cerega66 Or you can try to compile it into VULKAN version with CMAKE_ARGS="-DGGML_VULKAN=on". |
@JamePeng Thanks for your help. I found the problem. I replaced the lines:
With the line:
And now I have an error about the absence of cuda. But since I use the version with my environment, I run cmd_windows.bat. I managed to install cuda via:
But now it complains about the absence of vs 15 2017:
I have msvs on my PC, but I don’t know how to specify this to conda or how to install my own in conda. |
@cerega66 Thank you! I totally forgot that setting a variable separately out of a batch script is not working.... Now I was successful with this (note: the building is very, very long...): Installed MSVC build tools and CUDA toolkit Opened cmd_windows.bat and run:
And now it's finally working through GPU!
|
I was able to get my model to load in successfully (thanks for the work, all who are contributing) but attempting to inference is still throwing faults. TypeError: Llama.tokenize() "missing 1 required positional argument: 'text'" return self.model.tokenize(string) I get |
Describe the bug
Hi, I tried running the new DeepSeek model but get the following errors. I'm not sure if this needs added support for the pre-tokenizer?
Model location: https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF
Is there an existing issue for this?
Reproduction
Screenshot
No response
Logs
System Info
The text was updated successfully, but these errors were encountered: