Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New SOTA model DeepSeek-R1-Qwen won't load #6679

Open
1 task done
Alchete opened this issue Jan 20, 2025 · 42 comments
Open
1 task done

New SOTA model DeepSeek-R1-Qwen won't load #6679

Alchete opened this issue Jan 20, 2025 · 42 comments
Labels
bug Something isn't working

Comments

@Alchete
Copy link

Alchete commented Jan 20, 2025

Describe the bug

Hi, I tried running the new DeepSeek model but get the following errors. I'm not sure if this needs added support for the pre-tokenizer?

Model location: https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

  1. Download model file
  2. Load it on the "Models" tab

Screenshot

No response

Logs

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'deepseek-r1-qwen'
llama_model_load_from_file: failed to load model
12:25:23-174171 ERROR    Failed to load the model.
Traceback (most recent call last):
  File "E:\StableDiffusion\text-generation-webui-2.3\modules\ui_model_menu.py", line 214, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\StableDiffusion\text-generation-webui-2.3\modules\models.py", line 90, in load_model
    output = load_func_map[loader](model_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\StableDiffusion\text-generation-webui-2.3\modules\models.py", line 280, in llamacpp_loader
    model, tokenizer = LlamaCppModel.from_pretrained(model_file)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\StableDiffusion\text-generation-webui-2.3\modules\llamacpp_model.py", line 111, in from_pretrained
    result.model = Llama(**params)
                   ^^^^^^^^^^^^^^^
  File "E:\StableDiffusion\text-generation-webui-2.3\installer_files\env\Lib\site-packages\llama_cpp_cuda\llama.py", line 369, in __init__
    internals.LlamaModel(
  File "E:\StableDiffusion\text-generation-webui-2.3\installer_files\env\Lib\site-packages\llama_cpp_cuda\_internals.py", line 56, in __init__
    raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: models\DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf

Exception ignored in: <function LlamaCppModel.__del__ at 0x0000020F31EA4F40>
Traceback (most recent call last):
  File "E:\StableDiffusion\text-generation-webui-2.3\modules\llamacpp_model.py", line 62, in __del__
    del self.model
        ^^^^^^^^^^
AttributeError: 'LlamaCppModel' object has no attribute 'model'

System Info

Nvidia 4090
@Alchete Alchete added the bug Something isn't working label Jan 20, 2025
@hpnyaggerman
Copy link

hpnyaggerman commented Jan 20, 2025

Support for Deepseek-R1-Qwen tokenizer was only added to llama.cpp a few hours ago (ggerganov/llama.cpp@ec7f3ac). Will need to update the llama.cpp side of things.

@thistleknot
Copy link

error loading model: error loading model vocabulary: unknown pre-tokenizer type:

I installed all the latest packages, and still getting errors. I even specifically installed that version of llama.cpp directly (ec7f3ac), but that doesn't resolve text-generation-webui.

I know I'm not doing something correct here, but specific steps would help.

Is it more than just installing that commit?

how I'd install

        cd build
        cmake .. -DGGML_CUDA=ON -DLLAMA_AVX=ON -DLLAMA_AVX2=ON -DBUILD_SHARED_LIBS=ON
        cmake --build . --config Release

@hpnyaggerman
Copy link

error loading model: error loading model vocabulary: unknown pre-tokenizer type:

I installed all the latest packages, and still getting errors. I even specifically installed that version of llama.cpp directly (ec7f3ac), but that doesn't resolve text-generation-webui.

I know I'm not doing something correct here, but specific steps would help.

Is it more than just installing that commit?

how I'd install

        cd build
        cmake .. -DGGML_CUDA=ON -DLLAMA_AVX=ON -DLLAMA_AVX2=ON -DBUILD_SHARED_LIBS=ON
        cmake --build . --config Release

I am pretty sure that by default llama-cpp-python is installed from a wheel (as per https://github.com/oobabooga/text-generation-webui/blob/main/requirements.txt) which is pre-built and would thus need to be rebuilt (https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels) and requirements updated.

@hpnyaggerman
Copy link

Alternatively, you can try activating the installed conda environment and try to reinstall llama-cpp-python manually from https://pypi.org/project/llama-cpp-python/0.3.6/ with all the appropriate build flags.

@hpnyaggerman
Copy link

hpnyaggerman commented Jan 21, 2025

Launched a build action on my fork of https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels (https://github.com/hpnyaggerman/llama-cpp-python-cuBLAS-wheels/actions). Will take a while to finish. Currently for CUDA only. I hope the maintainers notice the issue soon and update llama-cpp-python in text-generation-webui, but until then the output of my forks action should be sufficient.

@thistleknot
Copy link

thistleknot commented Jan 21, 2025 via email

@Alkohole
Copy link

"Im confused how they even create these ggufs without llama.cpp being even
updated yet as it holds quantize"

Judging by the changes in the converter, I assume they simply add tokenizer_pre from the new model themselves and proceed with the conversion without any issues.

@hpnyaggerman
Copy link

"Im confused how they even create these ggufs without llama.cpp being even
updated yet as it holds quantize"

Judging by the changes in the converter, I assume they simply add tokenizer_pre from the new model themselves and proceed with the conversion without any issues.

Yeah, it's a simple fix and it's easy to recompile if you're running locally, but less so in a complex assembly of dependencies that text-generation-webui has.

@thistleknot
Copy link

thistleknot commented Jan 21, 2025 via email

@hpnyaggerman
Copy link

So it sounds like qwen (pun intended) building llama-cpp-python we need to
link to llama.cpp build within the same folder as llama-cpp-python

I am almost done building llama-cpp-python for CUDA and Tensorcores (https://github.com/hpnyaggerman/llama-cpp-python-cuBLAS-wheels/actions). So if your system is similar to mine and requires those specific packages to infer with llama.cpp in the ui, you can just go to your cloned repository and sed -i 's/https:\/\/github\.com\/oobabooga\/llama\-cpp\-python\-cuBLAS\-wheels\//https:\/\/github\.com\/hpnyaggerman\/llama\-cpp\-python\-cuBLAS\-wheels\//g' *.txt and then update the ui (might potentially have to delete the installer files directory contents) until the problem is solved in the repository itself.

@thistleknot
Copy link

Launched a build action on my fork of https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels (https://github.com/hpnyaggerman/llama-cpp-python-cuBLAS-wheels/actions). Will take a while to finish. Currently for CUDA only. I hope the maintainers notice the issue soon and update llama-cpp-python in text-generation-webui, but until then the output of my forks action should be sufficient.

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'deepseek-r1-qwen'
llama_model_load_from_file: failed to load model
06:59:15-873168 ERROR    Failed to load the model.
Traceback (most recent call last):
  File "/home/user/text-generation-webui/modules/ui_model_menu.py", line 214, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
  File "/home/user/text-generation-webui/modules/models.py", line 90, in load_model
    output = load_func_map[loader](model_name)
  File "/home/user/text-generation-webui/modules/models.py", line 280, in llamacpp_loader
    model, tokenizer = LlamaCppModel.from_pretrained(model_file)
  File "/home/user/text-generation-webui/modules/llamacpp_model.py", line 111, in from_pretrained
    result.model = Llama(**params)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/llama_cpp_cuda_tensorcores/llama.py", line 369, in __init__
    internals.LlamaModel(
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/llama_cpp_cuda_tensorcores/_internals.py", line 56, in __init__
    raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: models/Qwen/DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf

Exception ignored in: <function LlamaCppModel.__del__ at 0x7f4f4421dc60>
Traceback (most recent call last):
  File "/home/user/text-generation-webui/modules/llamacpp_model.py", line 62, in __del__
    del self.model
AttributeError: model

1004 sed -i 's/https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels//https://github.com/hpnyaggerman/llama-cpp-python-cuBLAS-wheels//g' *.txt
1005 pip install -r requirements.txt

@oobabooga
Copy link
Owner

Unfortunately llama-cpp-python hasn't been receiving frequent updates, and easy-llama (a possible alternative) is not ready yet. To update llama-cpp-python manually, I use these commands:

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
git clone https://github.com/abetlen/llama-cpp-python
cd llama-cpp-python/vendor
git clone https://github.com/ggerganov/llama.cpp
cd ..
CMAKE_ARGS="-DGGML_CUDA=on" pip install . --verbose

@Alchete
Copy link
Author

Alchete commented Jan 22, 2025

Someone filed the issue at the llama repository.

@hpnyaggerman
Copy link

Launched a build action on my fork of https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels (https://github.com/hpnyaggerman/llama-cpp-python-cuBLAS-wheels/actions). Will take a while to finish. Currently for CUDA only. I hope the maintainers notice the issue soon and update llama-cpp-python in text-generation-webui, but until then the output of my forks action should be sufficient.

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'deepseek-r1-qwen'
llama_model_load_from_file: failed to load model
06:59:15-873168 ERROR    Failed to load the model.
Traceback (most recent call last):
  File "/home/user/text-generation-webui/modules/ui_model_menu.py", line 214, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
  File "/home/user/text-generation-webui/modules/models.py", line 90, in load_model
    output = load_func_map[loader](model_name)
  File "/home/user/text-generation-webui/modules/models.py", line 280, in llamacpp_loader
    model, tokenizer = LlamaCppModel.from_pretrained(model_file)
  File "/home/user/text-generation-webui/modules/llamacpp_model.py", line 111, in from_pretrained
    result.model = Llama(**params)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/llama_cpp_cuda_tensorcores/llama.py", line 369, in __init__
    internals.LlamaModel(
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/llama_cpp_cuda_tensorcores/_internals.py", line 56, in __init__
    raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: models/Qwen/DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf

Exception ignored in: <function LlamaCppModel.__del__ at 0x7f4f4421dc60>
Traceback (most recent call last):
  File "/home/user/text-generation-webui/modules/llamacpp_model.py", line 62, in __del__
    del self.model
AttributeError: model

1004 sed -i 's/https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels//https://github.com/hpnyaggerman/llama-cpp-python-cuBLAS-wheels//g' *.txt 1005 pip install -r requirements.txt

Yeah, I've realized the wheel building process does not actually involve using fresh source code but rather each llama-cpp-python version corresponds to a llama.cpp version, my bad.

@oobabooga
Copy link
Owner

It's possible to fork llama-cpp-python and change abetlen/llama-cpp-python to yourusername/llama-cpp-python in the workflows, but in this particular update, there have been updates to the llama.cpp internals that require updates in the Python bindings definitions. That's the difficulty with maintaining llama.cpp bindings -- the internals change all the time.

@imqqmi
Copy link

imqqmi commented Jan 22, 2025

As a possible work around you can always download the safetensors (non quantized original) version of the distilled versions using the text-generation-webui under the models tab and in the 'Download model or LoRA' section on the right/bottom.
ie paste this in the download input field:
deepseek-ai/DeepSeek-R1-Distill-Llama-8B:main

Then load the model with transformers, and use Q8 or Q4 options if the model is too large for your graphics card.

On my 4070ti it worked with Q8. It's a bit slower than other models of the same size, about 5 tokens per second. It's a very verbose model though so you better give it something to reason about, it's not suitable for roleplay for example, or a regular chat conversation.

@MushroomHunting
Copy link

MushroomHunting commented Jan 23, 2025

Kind of a sidebar comment, but is it correct that the exl2 quants are working? I'm running exl2 quants for the distils and they seem to be working perfectly with <think> </think> tags.

There is however a small bug i've noticed with the "Start reply with" option. R1 seems to always begin its response with

<think> ...

so if you put something in the Start reply with box, then for some reason it overrides the models' initial response, and then produce a buggy response. However, if you prefix it with <think> then it injects correctly.

e.g.

Start reply with: "hello my name is zeus"
results in: "hello my name is zeus ... </think> ..."

Start reply with: "<think> hello my name is zeus"
results in: "<think> hello my name is zeus .... </think> ..."

in the first case, sometimes R1 repeats itself randomly

@JonMike12341234
Copy link

still waiting on the llama-cpp update...

@RSAStudioGames
Copy link

I find that the EXL2 versions works without issue for me if you need something in the meantime.

@vggrodrigues
Copy link

I've installed llama-cpp-python from local as suggested (windows 11, cuda 12.6), following these steps:

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
git clone https://github.com/abetlen/llama-cpp-python
cd llama-cpp-python/vendor
git clone https://github.com/ggerganov/llama.cpp
cd ..
CMAKE_ARGS="-DGGML_CUDA=on" pip install. --verbose

But now it is failing the lib itself (see logs); any idea?

12:42:54-819405 INFO     Loading "DeepSeek-R1-Distill-Qwen-14B-Q4_K_L.gguf"
12:42:55-130417 INFO     llama.cpp weights detected: "models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_L.gguf"
12:42:55-132662 ERROR    Failed to load the model.
Traceback (most recent call last):
  File "repositories\llm\text-generation-webui\modules\ui_model_menu.py", line 214, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "repositories\llm\text-generation-webui\modules\models.py", line 90, in load_model
    output = load_func_map[loader](model_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "repositories\llm\text-generation-webui\modules\models.py", line 280, in llamacpp_loader
    model, tokenizer = LlamaCppModel.from_pretrained(model_file)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\v_vin\repositories\llm\text-generation-webui\modules\llamacpp_model.py", line 67, in from_pretrained
    Llama = llama_cpp_lib().Llama
            ^^^^^^^^^^^^^^^
  File "repositories\llm\text-generation-webui\modules\llama_cpp_python_hijack.py", line 46, in llama_cpp_lib
    return_lib = importlib.import_module(lib_name)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "repositories\llm\text-generation-webui\installer_files\env\Lib\importlib\__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "repositories\llm\text-generation-webui\installer_files\env\Lib\site-packages\llama_cpp\__init__.py", line 1, in <module>
    from .llama_cpp import *
  File "repositories\llm\text-generation-webui\installer_files\env\Lib\site-packages\llama_cpp\llama_cpp.py", line 1283, in <module>
    @ctypes_function("llama_rope_type", [llama_model_p_ctypes], ctypes.c_int)
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "repositories\llm\text-generation-webui\installer_files\env\Lib\site-packages\llama_cpp\_ctypes_extensions.py", line 113, in decorator
    func = getattr(lib, name)
           ^^^^^^^^^^^^^^^^^^
  File "repositories\llm\text-generation-webui\installer_files\env\Lib\ctypes\__init__.py", line 389, in __getattr__
    func = self.__getitem__(name)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "repositories\llm\text-generation-webui\installer_files\env\Lib\ctypes\__init__.py", line 394, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: function 'llama_rope_type' not found

@oobabooga
Copy link
Owner

It may work if you install llama-cpp-python from this repository instead:

abetlen/llama-cpp-python#1901

ie https://github.com/JamePeng/llama-cpp-python/tree/main

@thistleknot
Copy link

thistleknot commented Jan 24, 2025 via email

@vggrodrigues
Copy link

Thanks, @oobabooga. It sort of worked but ended up failing later. It almost finished loading the model, but now it's causing a memory access violation. Anyway, it no longer loads any GGUF with that version (always throwing the memory error). I'm sharing log for reference.

For now, I'm successfully using the Transformers version, and it works fine.
I'll revert the llama-cpp-python and wait for the fixes.
By the way, a random question: what are the packages :
llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores

They don't get installed when building from source.

13:44:13-725701 ERROR    Failed to load the model.
  File "modules\ui_model_menu.py", line 214, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "modules\models.py", line 90, in load_model
    output = load_func_map[loader](model_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "modules\models.py", line 280, in llamacpp_loader
    model, tokenizer = LlamaCppModel.from_pretrained(model_file)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "modules\llamacpp_model.py", line 111, in from_pretrained
    result.model = Llama(**params)
                   ^^^^^^^^^^^^^^^
  File "installer_files\env\Lib\site-packages\llama_cpp\llama.py", line 447, in __init__
    self._n_vocab = self.n_vocab()
                    ^^^^^^^^^^^^^^
  File "installer_files\env\Lib\site-packages\llama_cpp\llama.py", line 2174, in n_vocab
    return self._model.n_vocab()
           ^^^^^^^^^^^^^^^^^^^^^
  File "installer_files\env\Lib\site-packages\llama_cpp\_internals.py", line 91, in n_vocab
    return llama_cpp.llama_vocab_n_tokens(self.vocab)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: exception: access violation reading 0x000000009C1FD990

Exception ignored in: <function Llama.__del__ at 0x000001B1A395BD80>
Traceback (most recent call last):
  File "installer_files\env\Lib\site-packages\llama_cpp\llama.py", line 2201, in __del__
    self.close()
  File "installer_files\env\Lib\site-packages\llama_cpp\llama.py", line 2198, in close
    self._stack.close()
  File "installer_files\env\Lib\contextlib.py", line 609, in close
    self.__exit__(None, None, None)
  File "installer_files\env\Lib\contextlib.py", line 601, in __exit__
    raise exc_details[1]
  File "installer_files\env\Lib\contextlib.py", line 586, in __exit__
    if cb(*exc_details):
       ^^^^^^^^^^^^^^^^
  File "installer_files\env\Lib\contextlib.py", line 360, in __exit__
    self.thing.close()
  File "installer_files\env\Lib\site-packages\llama_cpp\_internals.py", line 82, in close
    self._exit_stack.close()
  File "installer_files\env\Lib\contextlib.py", line 609, in close
    self.__exit__(None, None, None)
  File "installer_files\env\Lib\contextlib.py", line 601, in __exit__
    raise exc_details[1]
  File "installer_files\env\Lib\contextlib.py", line 586, in __exit__
    if cb(*exc_details):
       ^^^^^^^^^^^^^^^^
  File "installer_files\env\Lib\contextlib.py", line 469, in _exit_wrapper
    callback(*args, **kwds)
  File "installer_files\env\Lib\site-packages\llama_cpp\_internals.py", line 76, in free_model
    llama_cpp.llama_model_free(self.vocab)
OSError: exception: access violation reading 0xFFFFFFFF9C1FF410
Exception ignored in: <function LlamaCppModel.__del__ at 0x000001B1A399B560>
Traceback (most recent call last):
  File "modules\llamacpp_model.py", line 62, in __del__
    del self.model
        ^^^^^^^^^^
AttributeError: 'LlamaCppModel' object has no attribute 'model'

@ljm625
Copy link

ljm625 commented Jan 25, 2025

It may work if you install llama-cpp-python from this repository instead:

abetlen/llama-cpp-python#1901

ie https://github.com/JamePeng/llama-cpp-python/tree/main

Tested this PR locally, try running llama-cpp-python openai compatible server and got segmentation fault on launch, I think there's still some bugs to be fix.

@SGL647
Copy link

SGL647 commented Jan 27, 2025

I'm having the same issue

@JamePeng
Copy link

It may work if you install llama-cpp-python from this repository instead:
abetlen/llama-cpp-python#1901
ie https://github.com/JamePeng/llama-cpp-python/tree/main

Tested this PR locally, try running llama-cpp-python openai compatible server and got segmentation fault on launch, I think there's still some bugs to be fix.

@ljm625 : ) Try again! I had fixed it !
https://github.com/JamePeng/llama-cpp-python/tree/main

@dominikbayerl
Copy link

Successfully loaded DeepSeek-R1-Distill-Qwen-32B-GGUF using https://github.com/JamePeng/llama-cpp-python/tree/main

@n9Mtq4
Copy link

n9Mtq4 commented Jan 28, 2025

I'm on linux and was getting linker errors. Tracked it down to it using my system's nvcc (12.6) and trying to link it to cuda 12.1 installed in the conda environment.

I first had to get a 12.1 version of nvcc and gcc < 12 installed in the environment.

conda install -c "nvidia/label/cuda-12.1.1" cuda-toolkit 
conda install gcc_linux-64==11.2.0 gxx_linux-64==11.2.0
cd ./installer_files/env/bin
ln -s x86_64-conda-linux-gnu-g++ g++
ln -s x86_64-conda-linux-gnu-ld ld
ln -s x86_64-conda-linux-gnu-gcc gcc

Then the build worked for me.

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
git clone https://github.com/JamePeng/llama-cpp-python.git
cd llama-cpp-python/vendor
git clone https://github.com/ggerganov/llama.cpp
cd ..
CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install . --no-cache-dir --verbose

@maddog7667
Copy link

maddog7667 commented Jan 28, 2025

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
git clone https://github.com/JamePeng/llama-cpp-python.git
cd llama-cpp-python/vendor
git clone https://github.com/ggerganov/llama.cpp
cd ..
CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install . --no-cache-dir --verbose

Does anyone know of a way to get this to work on a windows based machine? Does anyone know if an update to oobabooga is incoming?

@hugodopradofernandes
Copy link

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores git clone https://github.com/JamePeng/llama-cpp-python.git cd llama-cpp-python/vendor git clone https://github.com/ggerganov/llama.cpp cd .. CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install . --no-cache-dir --verbose

Does anyone know of a way to get this to work on a windows based machine? Does anyone know if an update to oobabooga is incoming?

You have to activate the conda environment first. Run "conda env list". Then "conda activate" followed by the ENV conda environmente from the text generaion webui.

After that you can use the commands above. You also need git installed. You may need compilers installed also, not sure....

@JamePeng
Copy link

Hello @maddog7667, If you want to compile the CUDA version of llama_cpp_python in windows environment that need some prepare operation firstly,

  1. MSVC build tools with use c++ desktop develop components (include cmake) : https://download.visualstudio.microsoft.com/download/pr/9e5046bb-ab15-4a45-9546-cbabed333482/e44275c738c3b146c1acbf6fadd059ff9567ce97113cc584886cdc6985bfe538/vs_BuildTools.exe
  2. CUDA toolkit and cuDNN:https://www.nvidia.com/content/cuda/cuda-toolkit.html
  3. Use the git clone the project and open the powershell console in the llama_cpp_python folder for the next step
  4. Enter $env:CMAKE_ARGS = "-DGGML_CUDA=on" instruction into the powershell as a cmake environment variables
  5. Enter pip uninstall -y llama_cpp_python instruction into the powershell, confirm the old pip wheel had been removed
  6. Confirm the powershell located in the llama_cpp_python folder, Enter pip install . instruction into the powershell
  7. The compile progress maybe cost a little time, just waiting the progress completed

@hugodopradofernandes
Copy link

I'm having a problem that the installation/compiler process is spawning hundreds of ninja threads, causing the OS to kill all processes. There is a way to limit this? I searched, tried multiple flags, but without success. I'm unable to install due to that

@thistleknot
Copy link

thistleknot commented Jan 28, 2025 via email

@tnovak007
Copy link

tnovak007 commented Jan 28, 2025

Hi,

thank you @JamePeng.

I had the same problem, on W11, fixed it like this:

Installed MSVC build tools
and CUDA toolkit

Opened cmd_windows.bat and run:

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
git clone https://github.com/JamePeng/llama-cpp-python.git
cd llama-cpp-python/vendor
git clone https://github.com/ggerganov/llama.cpp
cd ..
set CMAKE_ARGS = "-DGGML_CUDA=on"
set FORCE_CMAKE=1
pip install . --no-cache-dir --verbose

EDIT: Celebrated too soon :-( Model is now loaded, but use CPU instead GPU...
Same issue here: abetlen/llama-cpp-python#1901

@JamePeng
Copy link

abetlen has adapted the new version of llama.cpp in llama-cpp-python, which is good. Maybe you can try again!!

@cerega66
Copy link

Hi,

thank you @JamePeng.

I had the same problem, on W11, fixed it like this:

Installed MSVC build tools and CUDA toolkit

Opened cmd_windows.bat and run:

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
git clone https://github.com/JamePeng/llama-cpp-python.git
cd llama-cpp-python/vendor
git clone https://github.com/ggerganov/llama.cpp
cd ..
set CMAKE_ARGS = "-DGGML_CUDA=on"
set FORCE_CMAKE=1
pip install . --no-cache-dir --verbose

EDIT: Celebrated too soon :-( Model is now loaded, but use CPU instead GPU... Same issue here: abetlen/llama-cpp-python#1901

The same problem. The assembler doesn't seem to see the "-DGGML_CUDA=on" flag, since in the console I don't see the launch of the Cuda and tesorcores assembly itself or errors related to the absence of anything.
I tried assembling both with "git clone https://github.com/abetlen/llama-cpp-python" and with "git clone https://github.com/JamePeng/llama-cpp-python.git".

@JamePeng
Copy link

@cerega66 Are there any logs or errors during the compilation process with the "-DGGML_CUDA=on" flag ?

@cerega66
Copy link

@cerega66 Are there any logs or errors during the compilation process with the "-DGGML_CUDA=on" flag ?

No. llama-cpp-python builds fine, but I can only use cpu. I looked through the entire log and found no mention of CUDA or tensercore.

This is short log:

(R:\one-click-installers-main\text-generation-webui-2.0\installer_files\env) R:\one-click-installers-main\text-generation-webui-2.0>pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
Found existing installation: llama_cpp_python 0.3.6+cpuavx2
Uninstalling llama_cpp_python-0.3.6+cpuavx2:
  Successfully uninstalled llama_cpp_python-0.3.6+cpuavx2
Found existing installation: llama_cpp_python_cuda 0.3.6+cu121
Uninstalling llama_cpp_python_cuda-0.3.6+cu121:
  Successfully uninstalled llama_cpp_python_cuda-0.3.6+cu121
Found existing installation: llama_cpp_python_cuda_tensorcores 0.3.6+cu121
Uninstalling llama_cpp_python_cuda_tensorcores-0.3.6+cu121:
  Successfully uninstalled llama_cpp_python_cuda_tensorcores-0.3.6+cu121

(R:\one-click-installers-main\text-generation-webui-2.0\installer_files\env) R:\one-click-installers-main\text-generation-webui-2.0>cd llama-cpp-python

(R:\one-click-installers-main\text-generation-webui-2.0\installer_files\env) R:\one-click-installers-main\text-generation-webui-2.0\llama-cpp-python>set CMAKE_ARGS = "-DGGML_CUDA=on"

(R:\one-click-installers-main\text-generation-webui-2.0\installer_files\env) R:\one-click-installers-main\text-generation-webui-2.0\llama-cpp-python>set FORCE_CMAKE=1

(R:\one-click-installers-main\text-generation-webui-2.0\installer_files\env) R:\one-click-installers-main\text-generation-webui-2.0\llama-cpp-python>pip install .
Processing r:\one-click-installers-main\text-generation-webui-2.0\llama-cpp-python
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: typing-extensions>=4.5.0 in r:\one-click-installers-main\text-generation-webui-2.0\installer_files\env\lib\site-packages (from llama_cpp_python==0.3.7) (4.12.2)
Requirement already satisfied: numpy>=1.20.0 in r:\one-click-installers-main\text-generation-webui-2.0\installer_files\env\lib\site-packages (from llama_cpp_python==0.3.7) (1.26.4)
Requirement already satisfied: diskcache>=5.6.1 in r:\one-click-installers-main\text-generation-webui-2.0\installer_files\env\lib\site-packages (from llama_cpp_python==0.3.7) (5.6.3)
Requirement already satisfied: jinja2>=2.11.3 in r:\one-click-installers-main\text-generation-webui-2.0\installer_files\env\lib\site-packages (from llama_cpp_python==0.3.7) (3.1.5)
Requirement already satisfied: MarkupSafe>=2.0 in r:\one-click-installers-main\text-generation-webui-2.0\installer_files\env\lib\site-packages (from jinja2>=2.11.3->llama_cpp_python==0.3.7) (2.1.5)
Building wheels for collected packages: llama_cpp_python
  Building wheel for llama_cpp_python (pyproject.toml) ... done
  Created wheel for llama_cpp_python: filename=llama_cpp_python-0.3.7-cp311-cp311-win_amd64.whl size=3791897 sha256=596c50b7627a9c7cf75f01053b5567c69800b0e2f9c5cd775619d878c7ac7911
  Stored in directory: c:\users\cerega66\appdata\local\pip\cache\wheels\01\62\ca\c996c6379065a8c398e77702c542f9b9c9dcabc7326def63fc
Successfully built llama_cpp_python
Installing collected packages: llama_cpp_python
Successfully installed llama_cpp_python-0.3.7

(R:\one-click-installers-main\text-generation-webui-2.0\installer_files\env) R:\one-click-installers-main\text-generation-webui-2.0\llama-cpp-python>pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
Found existing installation: llama_cpp_python 0.3.7
Uninstalling llama_cpp_python-0.3.7:
  Successfully uninstalled llama_cpp_python-0.3.7
WARNING: Skipping llama_cpp_python_cuda as it is not installed.
WARNING: Skipping llama_cpp_python_cuda_tensorcores as it is not installed.

I can compile with -verboss.

@JamePeng
Copy link

@cerega66 Or you can try to compile it into VULKAN version with CMAKE_ARGS="-DGGML_VULKAN=on".
Maybe you need to Install the Vulkan SDK :https://sdk.lunarg.com/sdk/download/1.4.304.0/windows/VulkanSDK-1.4.304.0-Installer.exe
At present, there is no computer with Nvidia computing card in my hometown during the Spring Festival. Only AMD computing card can compile Vulkan version to use graphics card.

Image

@cerega66
Copy link

@JamePeng Thanks for your help. I found the problem. I replaced the lines:

set CMAKE_ARGS = "-DGGML_CUDA=on"
set FORCE_CMAKE=1
pip install . --no-cache-dir --verbose

With the line:

set FORCE_CMAKE=1 && set CMAKE_ARGS=-DGGML_CUDA=on && pip install . --no-cache-dir --verbose

And now I have an error about the absence of cuda. ​​But since I use the version with my environment, I run cmd_windows.bat. I managed to install cuda via:

conda install cuda-toolkit

But now it complains about the absence of vs 15 2017:

  -- Building for: Visual Studio 15 2017 Win64
  CMake Error at CMakeLists.txt:3 (project):
    Generator

      Visual Studio 15 2017 Win64

    could not find any instance of Visual Studio.

I have msvs on my PC, but I don’t know how to specify this to conda or how to install my own in conda.

@tnovak007
Copy link

tnovak007 commented Jan 29, 2025

@cerega66 Thank you! I totally forgot that setting a variable separately out of a batch script is not working....

Now I was successful with this (note: the building is very, very long...):

Installed MSVC build tools and CUDA toolkit

Opened cmd_windows.bat and run:

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores
git clone https://github.com/abetlen/llama-cpp-python
cd llama-cpp-python/vendor
git clone https://github.com/ggerganov/llama.cpp
cd ..
set FORCE_CMAKE=1 && set CMAKE_ARGS=-DGGML_CUDA=on && pip install . --no-cache-dir --verbose
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\llava.dll
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\llava.lib
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\llama.dll
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\llama.lib
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\ggml.dll
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\ggml.lib
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\ggml-cuda.dll
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\ggml-cuda.lib
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\ggml-cpu.dll
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\ggml-cpu.lib
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\ggml-base.dll
D:\A\CHAT\text-generation-webui-main\llama-cpp-python\llama_cpp\lib\ggml-base.lib

And now it's finally working through GPU!

load_tensors: layer   0 assigned to device CUDA0
load_tensors: layer   1 assigned to device CUDA0
load_tensors: layer   2 assigned to device CUDA0
load_tensors: layer   3 assigned to device CUDA0
load_tensors: layer   4 assigned to device CUDA0
load_tensors: layer   5 assigned to device CUDA0
load_tensors: layer   6 assigned to device CUDA0
load_tensors: layer   7 assigned to device CUDA0
load_tensors: layer   8 assigned to device CUDA0
load_tensors: layer   9 assigned to device CUDA0
load_tensors: layer  10 assigned to device CUDA0
load_tensors: layer  11 assigned to device CUDA0
load_tensors: layer  12 assigned to device CUDA0
load_tensors: layer  13 assigned to device CUDA0

@JousterL
Copy link

I was able to get my model to load in successfully (thanks for the work, all who are contributing) but attempting to inference is still throwing faults.

TypeError: Llama.tokenize() "missing 1 required positional argument: 'text'"
Or if I manually edit modules/llamacpp_model.py to replace on line 122:

return self.model.tokenize(string)
with
return self.model.tokenize(text=string)

I get
TypeError: Llama.tokenize() "missing 1 required positional argument: 'vocab'"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests