New SOTA model DeepSeek-R1-Qwen won't load #6679

Alchete · 2025-01-20T17:32:31Z

Describe the bug

Hi, I tried running the new DeepSeek model but get the following errors. I'm not sure if this needs added support for the pre-tokenizer?

Model location: https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF

Is there an existing issue for this?

I have searched the existing issues

Reproduction

Download model file
Load it on the "Models" tab

Screenshot

No response

Logs

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'deepseek-r1-qwen'
llama_model_load_from_file: failed to load model
12:25:23-174171 ERROR    Failed to load the model.
Traceback (most recent call last):
  File "E:\StableDiffusion\text-generation-webui-2.3\modules\ui_model_menu.py", line 214, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\StableDiffusion\text-generation-webui-2.3\modules\models.py", line 90, in load_model
    output = load_func_map[loader](model_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\StableDiffusion\text-generation-webui-2.3\modules\models.py", line 280, in llamacpp_loader
    model, tokenizer = LlamaCppModel.from_pretrained(model_file)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\StableDiffusion\text-generation-webui-2.3\modules\llamacpp_model.py", line 111, in from_pretrained
    result.model = Llama(**params)
                   ^^^^^^^^^^^^^^^
  File "E:\StableDiffusion\text-generation-webui-2.3\installer_files\env\Lib\site-packages\llama_cpp_cuda\llama.py", line 369, in __init__
    internals.LlamaModel(
  File "E:\StableDiffusion\text-generation-webui-2.3\installer_files\env\Lib\site-packages\llama_cpp_cuda\_internals.py", line 56, in __init__
    raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: models\DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf

Exception ignored in: <function LlamaCppModel.__del__ at 0x0000020F31EA4F40>
Traceback (most recent call last):
  File "E:\StableDiffusion\text-generation-webui-2.3\modules\llamacpp_model.py", line 62, in __del__
    del self.model
        ^^^^^^^^^^
AttributeError: 'LlamaCppModel' object has no attribute 'model'

System Info

Nvidia 4090

hpnyaggerman · 2025-01-20T19:28:23Z

Support for Deepseek-R1-Qwen tokenizer was only added to llama.cpp a few hours ago (ggerganov/llama.cpp@ec7f3ac). Will need to update the llama.cpp side of things.

thistleknot · 2025-01-21T03:02:21Z

error loading model: error loading model vocabulary: unknown pre-tokenizer type:

I installed all the latest packages, and still getting errors. I even specifically installed that version of llama.cpp directly (ec7f3ac), but that doesn't resolve text-generation-webui.

I know I'm not doing something correct here, but specific steps would help.

Is it more than just installing that commit?

how I'd install

        cd build
        cmake .. -DGGML_CUDA=ON -DLLAMA_AVX=ON -DLLAMA_AVX2=ON -DBUILD_SHARED_LIBS=ON
        cmake --build . --config Release

hpnyaggerman · 2025-01-21T07:26:08Z

error loading model: error loading model vocabulary: unknown pre-tokenizer type:

I installed all the latest packages, and still getting errors. I even specifically installed that version of llama.cpp directly (ec7f3ac), but that doesn't resolve text-generation-webui.

I know I'm not doing something correct here, but specific steps would help.

Is it more than just installing that commit?

how I'd install
        cd build
        cmake .. -DGGML_CUDA=ON -DLLAMA_AVX=ON -DLLAMA_AVX2=ON -DBUILD_SHARED_LIBS=ON
        cmake --build . --config Release

I am pretty sure that by default llama-cpp-python is installed from a wheel (as per https://github.com/oobabooga/text-generation-webui/blob/main/requirements.txt) which is pre-built and would thus need to be rebuilt (https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels) and requirements updated.

hpnyaggerman · 2025-01-21T07:27:53Z

Alternatively, you can try activating the installed conda environment and try to reinstall llama-cpp-python manually from https://pypi.org/project/llama-cpp-python/0.3.6/ with all the appropriate build flags.

hpnyaggerman · 2025-01-21T08:33:20Z

Launched a build action on my fork of https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels (https://github.com/hpnyaggerman/llama-cpp-python-cuBLAS-wheels/actions). Will take a while to finish. Currently for CUDA only. I hope the maintainers notice the issue soon and update llama-cpp-python in text-generation-webui, but until then the output of my forks action should be sufficient.

thistleknot · 2025-01-21T15:27:13Z

Im confused how they even create these ggufs without llama.cpp being even updated yet as it holds quantize

…

On Tue, Jan 21, 2025, 12:33 AM hpnyaggerman ***@***.***> wrote: Launched a build action on my fork of https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels. Will take a while to finish. Currently for CUDA only. I hope the maintainers notice the issue soon and update llama-cpp-python in text-generation-webui, but until then the output of my forks action should be sufficient. — Reply to this email directly, view it on GitHub <#6679 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHKKORSBNWO2BVKTAC3DET2LYA6LAVCNFSM6AAAAABVQ3GR4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBTHE3DQMBYHE> . You are receiving this because you commented.Message ID: ***@***.***>

Alkohole · 2025-01-21T16:34:54Z

"Im confused how they even create these ggufs without llama.cpp being even
updated yet as it holds quantize"

Judging by the changes in the converter, I assume they simply add tokenizer_pre from the new model themselves and proceed with the conversion without any issues.

hpnyaggerman · 2025-01-21T17:01:40Z

"Im confused how they even create these ggufs without llama.cpp being even
updated yet as it holds quantize"

Judging by the changes in the converter, I assume they simply add tokenizer_pre from the new model themselves and proceed with the conversion without any issues.

Yeah, it's a simple fix and it's easy to recompile if you're running locally, but less so in a complex assembly of dependencies that text-generation-webui has.

thistleknot · 2025-01-21T18:20:03Z

So it sounds like qwen (pun intended) building llama-cpp-python we need to link to llama.cpp build within the same folder as llama-cpp-python

…

On Tue, Jan 21, 2025, 9:02 AM hpnyaggerman ***@***.***> wrote: *"Im confused how they even create these ggufs without llama.cpp being even updated yet as it holds quantize"* Judging by the changes in the converter, I assume they simply add tokenizer_pre from the new model themselves and proceed with the conversion without any issues. Yeah, it's a simple fix and it's easy to recompile if you're running locally, but less so in a complex assembly of dependencies that text-generation-webui has. — Reply to this email directly, view it on GitHub <#6679 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHKKOWX4MWPKCZHP4G26ED2LZ4RZAVCNFSM6AAAAABVQ3GR4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBVGI3TSMZQGA> . You are receiving this because you commented.Message ID: ***@***.***>

hpnyaggerman · 2025-01-21T18:29:14Z

So it sounds like qwen (pun intended) building llama-cpp-python we need to
link to llama.cpp build within the same folder as llama-cpp-python
…

I am almost done building llama-cpp-python for CUDA and Tensorcores (https://github.com/hpnyaggerman/llama-cpp-python-cuBLAS-wheels/actions). So if your system is similar to mine and requires those specific packages to infer with llama.cpp in the ui, you can just go to your cloned repository and sed -i 's/https:\/\/github\.com\/oobabooga\/llama\-cpp\-python\-cuBLAS\-wheels\//https:\/\/github\.com\/hpnyaggerman\/llama\-cpp\-python\-cuBLAS\-wheels\//g' *.txt and then update the ui (might potentially have to delete the installer files directory contents) until the problem is solved in the repository itself.

thistleknot · 2025-01-22T14:59:53Z

Launched a build action on my fork of https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels (https://github.com/hpnyaggerman/llama-cpp-python-cuBLAS-wheels/actions). Will take a while to finish. Currently for CUDA only. I hope the maintainers notice the issue soon and update llama-cpp-python in text-generation-webui, but until then the output of my forks action should be sufficient.

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'deepseek-r1-qwen'
llama_model_load_from_file: failed to load model
06:59:15-873168 ERROR    Failed to load the model.
Traceback (most recent call last):
  File "/home/user/text-generation-webui/modules/ui_model_menu.py", line 214, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
  File "/home/user/text-generation-webui/modules/models.py", line 90, in load_model
    output = load_func_map[loader](model_name)
  File "/home/user/text-generation-webui/modules/models.py", line 280, in llamacpp_loader
    model, tokenizer = LlamaCppModel.from_pretrained(model_file)
  File "/home/user/text-generation-webui/modules/llamacpp_model.py", line 111, in from_pretrained
    result.model = Llama(**params)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/llama_cpp_cuda_tensorcores/llama.py", line 369, in __init__
    internals.LlamaModel(
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/llama_cpp_cuda_tensorcores/_internals.py", line 56, in __init__
    raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: models/Qwen/DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf

Exception ignored in: <function LlamaCppModel.__del__ at 0x7f4f4421dc60>
Traceback (most recent call last):
  File "/home/user/text-generation-webui/modules/llamacpp_model.py", line 62, in __del__
    del self.model
AttributeError: model

1004 sed -i 's/https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels//https://github.com/hpnyaggerman/llama-cpp-python-cuBLAS-wheels//g' *.txt
1005 pip install -r requirements.txt

oobabooga · 2025-01-22T15:53:09Z

Unfortunately llama-cpp-python hasn't been receiving frequent updates, and easy-llama (a possible alternative) is not ready yet. To update llama-cpp-python manually, I use these commands:

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
git clone https://github.com/abetlen/llama-cpp-python
cd llama-cpp-python/vendor
git clone https://github.com/ggerganov/llama.cpp
cd ..
CMAKE_ARGS="-DGGML_CUDA=on" pip install . --verbose

Alchete · 2025-01-22T16:07:29Z

Someone filed the issue at the llama repository.

hpnyaggerman · 2025-01-22T17:23:48Z

Launched a build action on my fork of https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels (https://github.com/hpnyaggerman/llama-cpp-python-cuBLAS-wheels/actions). Will take a while to finish. Currently for CUDA only. I hope the maintainers notice the issue soon and update llama-cpp-python in text-generation-webui, but until then the output of my forks action should be sufficient.

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'deepseek-r1-qwen'
llama_model_load_from_file: failed to load model
06:59:15-873168 ERROR    Failed to load the model.
Traceback (most recent call last):
  File "/home/user/text-generation-webui/modules/ui_model_menu.py", line 214, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
  File "/home/user/text-generation-webui/modules/models.py", line 90, in load_model
    output = load_func_map[loader](model_name)
  File "/home/user/text-generation-webui/modules/models.py", line 280, in llamacpp_loader
    model, tokenizer = LlamaCppModel.from_pretrained(model_file)
  File "/home/user/text-generation-webui/modules/llamacpp_model.py", line 111, in from_pretrained
    result.model = Llama(**params)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/llama_cpp_cuda_tensorcores/llama.py", line 369, in __init__
    internals.LlamaModel(
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/llama_cpp_cuda_tensorcores/_internals.py", line 56, in __init__
    raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: models/Qwen/DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf

Exception ignored in: <function LlamaCppModel.__del__ at 0x7f4f4421dc60>
Traceback (most recent call last):
  File "/home/user/text-generation-webui/modules/llamacpp_model.py", line 62, in __del__
    del self.model
AttributeError: model

1004 sed -i 's/https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels//https://github.com/hpnyaggerman/llama-cpp-python-cuBLAS-wheels//g' *.txt 1005 pip install -r requirements.txt

Yeah, I've realized the wheel building process does not actually involve using fresh source code but rather each llama-cpp-python version corresponds to a llama.cpp version, my bad.

oobabooga · 2025-01-22T20:03:47Z

It's possible to fork llama-cpp-python and change abetlen/llama-cpp-python to yourusername/llama-cpp-python in the workflows, but in this particular update, there have been updates to the llama.cpp internals that require updates in the Python bindings definitions. That's the difficulty with maintaining llama.cpp bindings -- the internals change all the time.

imqqmi · 2025-01-22T20:47:29Z

As a possible work around you can always download the safetensors (non quantized original) version of the distilled versions using the text-generation-webui under the models tab and in the 'Download model or LoRA' section on the right/bottom.
ie paste this in the download input field:
deepseek-ai/DeepSeek-R1-Distill-Llama-8B:main

Then load the model with transformers, and use Q8 or Q4 options if the model is too large for your graphics card.

On my 4070ti it worked with Q8. It's a bit slower than other models of the same size, about 5 tokens per second. It's a very verbose model though so you better give it something to reason about, it's not suitable for roleplay for example, or a regular chat conversation.

MushroomHunting · 2025-01-23T06:15:09Z

Kind of a sidebar comment, but is it correct that the exl2 quants are working? I'm running exl2 quants for the distils and they seem to be working perfectly with <think> </think> tags.

There is however a small bug i've noticed with the "Start reply with" option. R1 seems to always begin its response with

<think> ...

so if you put something in the Start reply with box, then for some reason it overrides the models' initial response, and then produce a buggy response. However, if you prefix it with <think> then it injects correctly.

e.g.

Start reply with: "hello my name is zeus"
results in: "hello my name is zeus ... </think> ..."

Start reply with: "<think> hello my name is zeus"
results in: "<think> hello my name is zeus .... </think> ..."

in the first case, sometimes R1 repeats itself randomly

JonMike12341234 · 2025-01-23T18:16:05Z

still waiting on the llama-cpp update...

RSAStudioGames · 2025-01-24T14:11:20Z

I find that the EXL2 versions works without issue for me if you need something in the meantime.

vggrodrigues · 2025-01-24T17:55:32Z

I've installed llama-cpp-python from local as suggested (windows 11, cuda 12.6), following these steps:

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
git clone https://github.com/abetlen/llama-cpp-python
cd llama-cpp-python/vendor
git clone https://github.com/ggerganov/llama.cpp
cd ..
CMAKE_ARGS="-DGGML_CUDA=on" pip install. --verbose

But now it is failing the lib itself (see logs); any idea?

12:42:54-819405 INFO     Loading "DeepSeek-R1-Distill-Qwen-14B-Q4_K_L.gguf"
12:42:55-130417 INFO     llama.cpp weights detected: "models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_L.gguf"
12:42:55-132662 ERROR    Failed to load the model.
Traceback (most recent call last):
  File "repositories\llm\text-generation-webui\modules\ui_model_menu.py", line 214, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "repositories\llm\text-generation-webui\modules\models.py", line 90, in load_model
    output = load_func_map[loader](model_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "repositories\llm\text-generation-webui\modules\models.py", line 280, in llamacpp_loader
    model, tokenizer = LlamaCppModel.from_pretrained(model_file)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\v_vin\repositories\llm\text-generation-webui\modules\llamacpp_model.py", line 67, in from_pretrained
    Llama = llama_cpp_lib().Llama
            ^^^^^^^^^^^^^^^
  File "repositories\llm\text-generation-webui\modules\llama_cpp_python_hijack.py", line 46, in llama_cpp_lib
    return_lib = importlib.import_module(lib_name)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "repositories\llm\text-generation-webui\installer_files\env\Lib\importlib\__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "repositories\llm\text-generation-webui\installer_files\env\Lib\site-packages\llama_cpp\__init__.py", line 1, in <module>
    from .llama_cpp import *
  File "repositories\llm\text-generation-webui\installer_files\env\Lib\site-packages\llama_cpp\llama_cpp.py", line 1283, in <module>
    @ctypes_function("llama_rope_type", [llama_model_p_ctypes], ctypes.c_int)
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "repositories\llm\text-generation-webui\installer_files\env\Lib\site-packages\llama_cpp\_ctypes_extensions.py", line 113, in decorator
    func = getattr(lib, name)
           ^^^^^^^^^^^^^^^^^^
  File "repositories\llm\text-generation-webui\installer_files\env\Lib\ctypes\__init__.py", line 389, in __getattr__
    func = self.__getitem__(name)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "repositories\llm\text-generation-webui\installer_files\env\Lib\ctypes\__init__.py", line 394, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: function 'llama_rope_type' not found

oobabooga · 2025-01-24T18:10:14Z

It may work if you install llama-cpp-python from this repository instead:

abetlen/llama-cpp-python#1901

ie https://github.com/JamePeng/llama-cpp-python/tree/main

thistleknot · 2025-01-24T18:37:58Z

Ollama released a ver of this model. I'll make due w that until text gen updates

…

On Fri, Jan 24, 2025, 10:10 AM oobabooga ***@***.***> wrote: It may work if you install llama-cpp-python from this repository instead: abetlen/llama-cpp-python#1901 <abetlen/llama-cpp-python#1901> ie https://github.com/JamePeng/llama-cpp-python/tree/main — Reply to this email directly, view it on GitHub <#6679 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHKKOQRHF3KTYW5SNGQXZT2MJ6ZZAVCNFSM6AAAAABVQ3GR4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJTGEYTKNBRGE> . You are receiving this because you commented.Message ID: ***@***.***>

vggrodrigues · 2025-01-24T19:03:58Z

Thanks, @oobabooga. It sort of worked but ended up failing later. It almost finished loading the model, but now it's causing a memory access violation. Anyway, it no longer loads any GGUF with that version (always throwing the memory error). I'm sharing log for reference.

For now, I'm successfully using the Transformers version, and it works fine.
I'll revert the llama-cpp-python and wait for the fixes.
By the way, a random question: what are the packages :
llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores

They don't get installed when building from source.

13:44:13-725701 ERROR    Failed to load the model.
  File "modules\ui_model_menu.py", line 214, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "modules\models.py", line 90, in load_model
    output = load_func_map[loader](model_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "modules\models.py", line 280, in llamacpp_loader
    model, tokenizer = LlamaCppModel.from_pretrained(model_file)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "modules\llamacpp_model.py", line 111, in from_pretrained
    result.model = Llama(**params)
                   ^^^^^^^^^^^^^^^
  File "installer_files\env\Lib\site-packages\llama_cpp\llama.py", line 447, in __init__
    self._n_vocab = self.n_vocab()
                    ^^^^^^^^^^^^^^
  File "installer_files\env\Lib\site-packages\llama_cpp\llama.py", line 2174, in n_vocab
    return self._model.n_vocab()
           ^^^^^^^^^^^^^^^^^^^^^
  File "installer_files\env\Lib\site-packages\llama_cpp\_internals.py", line 91, in n_vocab
    return llama_cpp.llama_vocab_n_tokens(self.vocab)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: exception: access violation reading 0x000000009C1FD990

Exception ignored in: <function Llama.__del__ at 0x000001B1A395BD80>
Traceback (most recent call last):
  File "installer_files\env\Lib\site-packages\llama_cpp\llama.py", line 2201, in __del__
    self.close()
  File "installer_files\env\Lib\site-packages\llama_cpp\llama.py", line 2198, in close
    self._stack.close()
  File "installer_files\env\Lib\contextlib.py", line 609, in close
    self.__exit__(None, None, None)
  File "installer_files\env\Lib\contextlib.py", line 601, in __exit__
    raise exc_details[1]
  File "installer_files\env\Lib\contextlib.py", line 586, in __exit__
    if cb(*exc_details):
       ^^^^^^^^^^^^^^^^
  File "installer_files\env\Lib\contextlib.py", line 360, in __exit__
    self.thing.close()
  File "installer_files\env\Lib\site-packages\llama_cpp\_internals.py", line 82, in close
    self._exit_stack.close()
  File "installer_files\env\Lib\contextlib.py", line 609, in close
    self.__exit__(None, None, None)
  File "installer_files\env\Lib\contextlib.py", line 601, in __exit__
    raise exc_details[1]
  File "installer_files\env\Lib\contextlib.py", line 586, in __exit__
    if cb(*exc_details):
       ^^^^^^^^^^^^^^^^
  File "installer_files\env\Lib\contextlib.py", line 469, in _exit_wrapper
    callback(*args, **kwds)
  File "installer_files\env\Lib\site-packages\llama_cpp\_internals.py", line 76, in free_model
    llama_cpp.llama_model_free(self.vocab)
OSError: exception: access violation reading 0xFFFFFFFF9C1FF410
Exception ignored in: <function LlamaCppModel.__del__ at 0x000001B1A399B560>
Traceback (most recent call last):
  File "modules\llamacpp_model.py", line 62, in __del__
    del self.model
        ^^^^^^^^^^
AttributeError: 'LlamaCppModel' object has no attribute 'model'

ljm625 · 2025-01-25T15:27:43Z

It may work if you install llama-cpp-python from this repository instead:

abetlen/llama-cpp-python#1901

ie https://github.com/JamePeng/llama-cpp-python/tree/main

Tested this PR locally, try running llama-cpp-python openai compatible server and got segmentation fault on launch, I think there's still some bugs to be fix.

SGL647 · 2025-01-27T06:04:19Z

I'm having the same issue

JamePeng · 2025-01-27T13:09:12Z

It may work if you install llama-cpp-python from this repository instead:
abetlen/llama-cpp-python#1901
ie https://github.com/JamePeng/llama-cpp-python/tree/main

Tested this PR locally, try running llama-cpp-python openai compatible server and got segmentation fault on launch, I think there's still some bugs to be fix.

@ljm625 : ) Try again! I had fixed it !
https://github.com/JamePeng/llama-cpp-python/tree/main

dominikbayerl · 2025-01-27T13:18:29Z

Successfully loaded DeepSeek-R1-Distill-Qwen-32B-GGUF using https://github.com/JamePeng/llama-cpp-python/tree/main

n9Mtq4 · 2025-01-28T02:16:38Z

I'm on linux and was getting linker errors. Tracked it down to it using my system's nvcc (12.6) and trying to link it to cuda 12.1 installed in the conda environment.

I first had to get a 12.1 version of nvcc and gcc < 12 installed in the environment.

conda install -c "nvidia/label/cuda-12.1.1" cuda-toolkit 
conda install gcc_linux-64==11.2.0 gxx_linux-64==11.2.0
cd ./installer_files/env/bin
ln -s x86_64-conda-linux-gnu-g++ g++
ln -s x86_64-conda-linux-gnu-ld ld
ln -s x86_64-conda-linux-gnu-gcc gcc

Then the build worked for me.

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
git clone https://github.com/JamePeng/llama-cpp-python.git
cd llama-cpp-python/vendor
git clone https://github.com/ggerganov/llama.cpp
cd ..
CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install . --no-cache-dir --verbose

maddog7667 · 2025-01-28T14:43:48Z

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
git clone https://github.com/JamePeng/llama-cpp-python.git
cd llama-cpp-python/vendor
git clone https://github.com/ggerganov/llama.cpp
cd ..
CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install . --no-cache-dir --verbose

Does anyone know of a way to get this to work on a windows based machine? Does anyone know if an update to oobabooga is incoming?

hugodopradofernandes · 2025-01-28T16:00:42Z

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores git clone https://github.com/JamePeng/llama-cpp-python.git cd llama-cpp-python/vendor git clone https://github.com/ggerganov/llama.cpp cd .. CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install . --no-cache-dir --verbose

Does anyone know of a way to get this to work on a windows based machine? Does anyone know if an update to oobabooga is incoming?

You have to activate the conda environment first. Run "conda env list". Then "conda activate" followed by the ENV conda environmente from the text generaion webui.

After that you can use the commands above. You also need git installed. You may need compilers installed also, not sure....

JamePeng · 2025-01-28T16:44:24Z

Hello @maddog7667， If you want to compile the CUDA version of llama_cpp_python in windows environment that need some prepare operation firstly,

MSVC build tools with use c++ desktop develop components (include cmake) : https://download.visualstudio.microsoft.com/download/pr/9e5046bb-ab15-4a45-9546-cbabed333482/e44275c738c3b146c1acbf6fadd059ff9567ce97113cc584886cdc6985bfe538/vs_BuildTools.exe
CUDA toolkit and cuDNN：https://www.nvidia.com/content/cuda/cuda-toolkit.html
Use the git clone the project and open the powershell console in the llama_cpp_python folder for the next step
Enter $env:CMAKE_ARGS = "-DGGML_CUDA=on" instruction into the powershell as a cmake environment variables
Enter pip uninstall -y llama_cpp_python instruction into the powershell, confirm the old pip wheel had been removed
Confirm the powershell located in the llama_cpp_python folder, Enter pip install . instruction into the powershell
The compile progress maybe cost a little time, just waiting the progress completed

hugodopradofernandes · 2025-01-28T17:28:01Z

I'm having a problem that the installation/compiler process is spawning hundreds of ninja threads, causing the OS to kill all processes. There is a way to limit this? I searched, tried multiple flags, but without success. I'm unable to install due to that

thistleknot · 2025-01-28T19:14:52Z

Same Kills my device when building

…

On Tue, Jan 28, 2025, 9:28 AM Hugo do Prado ***@***.***> wrote: I'm having a problem that the installation/compiler process is spawning hundreds of ninja threads, causing the OS to kill all processes. There is a way to limit this? I searched, tried multiple flags, but without success. I'm unable to install due to that — Reply to this email directly, view it on GitHub <#6679 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHKKOSGUP6T25LQQ56TFDD2M643RAVCNFSM6AAAAABVQ3GR4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJZGYZTMOJSGM> . You are receiving this because you commented.Message ID: ***@***.***>

tnovak007 · 2025-01-28T19:15:59Z

Hi,

thank you @JamePeng.

I had the same problem, on W11, fixed it like this:

Installed MSVC build tools
and CUDA toolkit

Opened cmd_windows.bat and run:

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
git clone https://github.com/JamePeng/llama-cpp-python.git
cd llama-cpp-python/vendor
git clone https://github.com/ggerganov/llama.cpp
cd ..
set CMAKE_ARGS = "-DGGML_CUDA=on"
set FORCE_CMAKE=1
pip install . --no-cache-dir --verbose

EDIT: Celebrated too soon :-( Model is now loaded, but use CPU instead GPU...
Same issue here: abetlen/llama-cpp-python#1901

JamePeng · 2025-01-29T08:03:15Z

abetlen has adapted the new version of llama.cpp in llama-cpp-python, which is good. Maybe you can try again!!

cerega66 · 2025-01-29T09:15:22Z

Hi,

thank you @JamePeng.

I had the same problem, on W11, fixed it like this:

Installed MSVC build tools and CUDA toolkit

Opened cmd_windows.bat and run:
pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
git clone https://github.com/JamePeng/llama-cpp-python.git
cd llama-cpp-python/vendor
git clone https://github.com/ggerganov/llama.cpp
cd ..
set CMAKE_ARGS = "-DGGML_CUDA=on"
set FORCE_CMAKE=1
pip install . --no-cache-dir --verbose
EDIT: Celebrated too soon :-( Model is now loaded, but use CPU instead GPU... Same issue here: abetlen/llama-cpp-python#1901

The same problem. The assembler doesn't seem to see the "-DGGML_CUDA=on" flag, since in the console I don't see the launch of the Cuda and tesorcores assembly itself or errors related to the absence of anything.
I tried assembling both with "git clone https://github.com/abetlen/llama-cpp-python" and with "git clone https://github.com/JamePeng/llama-cpp-python.git".

JamePeng · 2025-01-29T09:20:30Z

@cerega66 Are there any logs or errors during the compilation process with the "-DGGML_CUDA=on" flag ?

cerega66 · 2025-01-29T09:27:56Z

@cerega66 Are there any logs or errors during the compilation process with the "-DGGML_CUDA=on" flag ?

No. llama-cpp-python builds fine, but I can only use cpu. I looked through the entire log and found no mention of CUDA or tensercore.

This is short log:

(R:\one-click-installers-main\text-generation-webui-2.0\installer_files\env) R:\one-click-installers-main\text-generation-webui-2.0>pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
Found existing installation: llama_cpp_python 0.3.6+cpuavx2
Uninstalling llama_cpp_python-0.3.6+cpuavx2:
  Successfully uninstalled llama_cpp_python-0.3.6+cpuavx2
Found existing installation: llama_cpp_python_cuda 0.3.6+cu121
Uninstalling llama_cpp_python_cuda-0.3.6+cu121:
  Successfully uninstalled llama_cpp_python_cuda-0.3.6+cu121
Found existing installation: llama_cpp_python_cuda_tensorcores 0.3.6+cu121
Uninstalling llama_cpp_python_cuda_tensorcores-0.3.6+cu121:
  Successfully uninstalled llama_cpp_python_cuda_tensorcores-0.3.6+cu121

(R:\one-click-installers-main\text-generation-webui-2.0\installer_files\env) R:\one-click-installers-main\text-generation-webui-2.0>cd llama-cpp-python

(R:\one-click-installers-main\text-generation-webui-2.0\installer_files\env) R:\one-click-installers-main\text-generation-webui-2.0\llama-cpp-python>set CMAKE_ARGS = "-DGGML_CUDA=on"

(R:\one-click-installers-main\text-generation-webui-2.0\installer_files\env) R:\one-click-installers-main\text-generation-webui-2.0\llama-cpp-python>set FORCE_CMAKE=1

(R:\one-click-installers-main\text-generation-webui-2.0\installer_files\env) R:\one-click-installers-main\text-generation-webui-2.0\llama-cpp-python>pip install .
Processing r:\one-click-installers-main\text-generation-webui-2.0\llama-cpp-python
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: typing-extensions>=4.5.0 in r:\one-click-installers-main\text-generation-webui-2.0\installer_files\env\lib\site-packages (from llama_cpp_python==0.3.7) (4.12.2)
Requirement already satisfied: numpy>=1.20.0 in r:\one-click-installers-main\text-generation-webui-2.0\installer_files\env\lib\site-packages (from llama_cpp_python==0.3.7) (1.26.4)
Requirement already satisfied: diskcache>=5.6.1 in r:\one-click-installers-main\text-generation-webui-2.0\installer_files\env\lib\site-packages (from llama_cpp_python==0.3.7) (5.6.3)
Requirement already satisfied: jinja2>=2.11.3 in r:\one-click-installers-main\text-generation-webui-2.0\installer_files\env\lib\site-packages (from llama_cpp_python==0.3.7) (3.1.5)
Requirement already satisfied: MarkupSafe>=2.0 in r:\one-click-installers-main\text-generation-webui-2.0\installer_files\env\lib\site-packages (from jinja2>=2.11.3->llama_cpp_python==0.3.7) (2.1.5)
Building wheels for collected packages: llama_cpp_python
  Building wheel for llama_cpp_python (pyproject.toml) ... done
  Created wheel for llama_cpp_python: filename=llama_cpp_python-0.3.7-cp311-cp311-win_amd64.whl size=3791897 sha256=596c50b7627a9c7cf75f01053b5567c69800b0e2f9c5cd775619d878c7ac7911
  Stored in directory: c:\users\cerega66\appdata\local\pip\cache\wheels\01\62\ca\c996c6379065a8c398e77702c542f9b9c9dcabc7326def63fc
Successfully built llama_cpp_python
Installing collected packages: llama_cpp_python
Successfully installed llama_cpp_python-0.3.7

(R:\one-click-installers-main\text-generation-webui-2.0\installer_files\env) R:\one-click-installers-main\text-generation-webui-2.0\llama-cpp-python>pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
Found existing installation: llama_cpp_python 0.3.7
Uninstalling llama_cpp_python-0.3.7:
  Successfully uninstalled llama_cpp_python-0.3.7
WARNING: Skipping llama_cpp_python_cuda as it is not installed.
WARNING: Skipping llama_cpp_python_cuda_tensorcores as it is not installed.

I can compile with -verboss.

JamePeng · 2025-01-29T09:42:15Z

@cerega66 Or you can try to compile it into VULKAN version with CMAKE_ARGS="-DGGML_VULKAN=on".
Maybe you need to Install the Vulkan SDK :https://sdk.lunarg.com/sdk/download/1.4.304.0/windows/VulkanSDK-1.4.304.0-Installer.exe
At present, there is no computer with Nvidia computing card in my hometown during the Spring Festival. Only AMD computing card can compile Vulkan version to use graphics card.

cerega66 · 2025-01-29T10:29:08Z

@JamePeng Thanks for your help. I found the problem. I replaced the lines:

set CMAKE_ARGS = "-DGGML_CUDA=on"
set FORCE_CMAKE=1
pip install . --no-cache-dir --verbose

With the line:

set FORCE_CMAKE=1 && set CMAKE_ARGS=-DGGML_CUDA=on && pip install . --no-cache-dir --verbose

And now I have an error about the absence of cuda. But since I use the version with my environment, I run cmd_windows.bat. I managed to install cuda via:

conda install cuda-toolkit

But now it complains about the absence of vs 15 2017:

  -- Building for: Visual Studio 15 2017 Win64
  CMake Error at CMakeLists.txt:3 (project):
    Generator

      Visual Studio 15 2017 Win64

    could not find any instance of Visual Studio.

I have msvs on my PC, but I don’t know how to specify this to conda or how to install my own in conda.

tnovak007 · 2025-01-29T13:47:05Z

@cerega66 Thank you! I totally forgot that setting a variable separately out of a batch script is not working....

Now I was successful with this (note: the building is very, very long...):

Installed MSVC build tools and CUDA toolkit

Opened cmd_windows.bat and run:

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
git clone https://github.com/abetlen/llama-cpp-python
cd llama-cpp-python/vendor
git clone https://github.com/ggerganov/llama.cpp
cd ..
set FORCE_CMAKE=1 && set CMAKE_ARGS=-DGGML_CUDA=on && pip install . --no-cache-dir --verbose

D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\llava.dll
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\llava.lib
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\llama.dll
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\llama.lib
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\ggml.dll
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\ggml.lib
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\ggml-cuda.dll
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\ggml-cuda.lib
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\ggml-cpu.dll
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\ggml-cpu.lib
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\ggml-base.dll
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\ggml-base.lib

And now it's finally working through GPU!

load_tensors: layer   0 assigned to device CUDA0
load_tensors: layer   1 assigned to device CUDA0
load_tensors: layer   2 assigned to device CUDA0
load_tensors: layer   3 assigned to device CUDA0
load_tensors: layer   4 assigned to device CUDA0
load_tensors: layer   5 assigned to device CUDA0
load_tensors: layer   6 assigned to device CUDA0
load_tensors: layer   7 assigned to device CUDA0
load_tensors: layer   8 assigned to device CUDA0
load_tensors: layer   9 assigned to device CUDA0
load_tensors: layer  10 assigned to device CUDA0
load_tensors: layer  11 assigned to device CUDA0
load_tensors: layer  12 assigned to device CUDA0
load_tensors: layer  13 assigned to device CUDA0

JousterL · 2025-01-29T14:27:14Z

I was able to get my model to load in successfully (thanks for the work, all who are contributing) but attempting to inference is still throwing faults.

TypeError: Llama.tokenize() "missing 1 required positional argument: 'text'"
Or if I manually edit modules/llamacpp_model.py to replace on line 122:

return self.model.tokenize(string)
with
return self.model.tokenize(text=string)

I get
TypeError: Llama.tokenize() "missing 1 required positional argument: 'vocab'"

Alchete added the bug Something isn't working label Jan 20, 2025

jepjoo mentioned this issue Jan 29, 2025

Error Loading DeepSeek-R1-Distill-Qwen-1.5B-uncensored.Q6_K.gguf in oobabooga Web UI: Unknown Pre-Tokenizer Type 'deepseek-r1-qwen' #6710

Open

New SOTA model DeepSeek-R1-Qwen won't load #6679

New SOTA model DeepSeek-R1-Qwen won't load #6679

Comments

Alchete commented Jan 20, 2025 • edited Loading

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

hpnyaggerman commented Jan 20, 2025 • edited Loading

thistleknot commented Jan 21, 2025

hpnyaggerman commented Jan 21, 2025

hpnyaggerman commented Jan 21, 2025

hpnyaggerman commented Jan 21, 2025 • edited Loading

thistleknot commented Jan 21, 2025 via email

Alkohole commented Jan 21, 2025

hpnyaggerman commented Jan 21, 2025

thistleknot commented Jan 21, 2025 via email

hpnyaggerman commented Jan 21, 2025

thistleknot commented Jan 22, 2025

oobabooga commented Jan 22, 2025

Alchete commented Jan 22, 2025

hpnyaggerman commented Jan 22, 2025

oobabooga commented Jan 22, 2025

imqqmi commented Jan 22, 2025

MushroomHunting commented Jan 23, 2025 • edited Loading

JonMike12341234 commented Jan 23, 2025

RSAStudioGames commented Jan 24, 2025

vggrodrigues commented Jan 24, 2025

oobabooga commented Jan 24, 2025

thistleknot commented Jan 24, 2025 via email

vggrodrigues commented Jan 24, 2025

ljm625 commented Jan 25, 2025

SGL647 commented Jan 27, 2025

JamePeng commented Jan 27, 2025

dominikbayerl commented Jan 27, 2025

n9Mtq4 commented Jan 28, 2025

maddog7667 commented Jan 28, 2025 • edited Loading

hugodopradofernandes commented Jan 28, 2025

JamePeng commented Jan 28, 2025

hugodopradofernandes commented Jan 28, 2025

thistleknot commented Jan 28, 2025 via email

tnovak007 commented Jan 28, 2025 • edited Loading

JamePeng commented Jan 29, 2025

cerega66 commented Jan 29, 2025

JamePeng commented Jan 29, 2025

cerega66 commented Jan 29, 2025

JamePeng commented Jan 29, 2025

cerega66 commented Jan 29, 2025

tnovak007 commented Jan 29, 2025 • edited Loading

JousterL commented Jan 29, 2025

Alchete commented Jan 20, 2025 •

edited

Loading

hpnyaggerman commented Jan 20, 2025 •

edited

Loading

hpnyaggerman commented Jan 21, 2025 •

edited

Loading

MushroomHunting commented Jan 23, 2025 •

edited

Loading

maddog7667 commented Jan 28, 2025 •

edited

Loading

tnovak007 commented Jan 28, 2025 •

edited

Loading

tnovak007 commented Jan 29, 2025 •

edited

Loading