Skip to content

Commit be20e7f

Browse files
authored
Reorganize documentation pages (ggml-org#8325)
* re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections
1 parent 7ed03b8 commit be20e7f

14 files changed

+626
-603
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ build*
4747
!build-info.cpp.in
4848
!build-info.sh
4949
!build.zig
50+
!docs/build.md
5051
/libllama.so
5152
/llama-*
5253
android-ndk-*

README.md

+69-599
Large diffs are not rendered by default.

docs/android.md

+56
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
2+
# Android
3+
4+
## Build on Android using Termux
5+
[Termux](https://github.com/termux/termux-app#installation) is a method to execute `llama.cpp` on an Android device (no root required).
6+
```
7+
apt update && apt upgrade -y
8+
apt install git make cmake
9+
```
10+
11+
It's recommended to move your model inside the `~/` directory for best performance:
12+
```
13+
cd storage/downloads
14+
mv model.gguf ~/
15+
```
16+
17+
[Get the code](https://github.com/ggerganov/llama.cpp#get-the-code) & [follow the Linux build instructions](https://github.com/ggerganov/llama.cpp#build) to build `llama.cpp`.
18+
19+
## Building the Project using Android NDK
20+
Obtain the [Android NDK](https://developer.android.com/ndk) and then build with CMake.
21+
22+
Execute the following commands on your computer to avoid downloading the NDK to your mobile. Alternatively, you can also do this in Termux:
23+
```
24+
$ mkdir build-android
25+
$ cd build-android
26+
$ export NDK=<your_ndk_directory>
27+
$ cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
28+
$ make
29+
```
30+
31+
Install [termux](https://github.com/termux/termux-app#installation) on your device and run `termux-setup-storage` to get access to your SD card (if Android 11+ then run the command twice).
32+
33+
Finally, copy these built `llama` binaries and the model file to your device storage. Because the file permissions in the Android sdcard cannot be changed, you can copy the executable files to the `/data/data/com.termux/files/home/bin` path, and then execute the following commands in Termux to add executable permission:
34+
35+
(Assumed that you have pushed the built executable files to the /sdcard/llama.cpp/bin path using `adb push`)
36+
```
37+
$cp -r /sdcard/llama.cpp/bin /data/data/com.termux/files/home/
38+
$cd /data/data/com.termux/files/home/bin
39+
$chmod +x ./*
40+
```
41+
42+
Download model [llama-2-7b-chat.Q4_K_M.gguf](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/blob/main/llama-2-7b-chat.Q4_K_M.gguf), and push it to `/sdcard/llama.cpp/`, then move it to `/data/data/com.termux/files/home/model/`
43+
44+
```
45+
$mv /sdcard/llama.cpp/llama-2-7b-chat.Q4_K_M.gguf /data/data/com.termux/files/home/model/
46+
```
47+
48+
Now, you can start chatting:
49+
```
50+
$cd /data/data/com.termux/files/home/bin
51+
$./llama-cli -m ../model/llama-2-7b-chat.Q4_K_M.gguf -n 128 -cml
52+
```
53+
54+
Here's a demo of an interactive session running on Pixel 5 phone:
55+
56+
https://user-images.githubusercontent.com/271616/225014776-1d567049-ad71-4ef2-b050-55b0b3b9274c.mp4
File renamed without changes.
File renamed without changes.

docs/build.md

+288
Large diffs are not rendered by default.

docs/HOWTO-add-model.md renamed to docs/development/HOWTO-add-model.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## Add a new model architecture to `llama.cpp`
1+
# Add a new model architecture to `llama.cpp`
22

33
Adding a model requires few steps:
44

File renamed without changes.

docs/docker.md

+86
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# Docker
2+
3+
## Prerequisites
4+
* Docker must be installed and running on your system.
5+
* Create a folder to store big models & intermediate files (ex. /llama/models)
6+
7+
## Images
8+
We have three Docker images available for this project:
9+
10+
1. `ghcr.io/ggerganov/llama.cpp:full`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. (platforms: `linux/amd64`, `linux/arm64`)
11+
2. `ghcr.io/ggerganov/llama.cpp:light`: This image only includes the main executable file. (platforms: `linux/amd64`, `linux/arm64`)
12+
3. `ghcr.io/ggerganov/llama.cpp:server`: This image only includes the server executable file. (platforms: `linux/amd64`, `linux/arm64`)
13+
14+
Additionally, there the following images, similar to the above:
15+
16+
- `ghcr.io/ggerganov/llama.cpp:full-cuda`: Same as `full` but compiled with CUDA support. (platforms: `linux/amd64`)
17+
- `ghcr.io/ggerganov/llama.cpp:light-cuda`: Same as `light` but compiled with CUDA support. (platforms: `linux/amd64`)
18+
- `ghcr.io/ggerganov/llama.cpp:server-cuda`: Same as `server` but compiled with CUDA support. (platforms: `linux/amd64`)
19+
- `ghcr.io/ggerganov/llama.cpp:full-rocm`: Same as `full` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
20+
- `ghcr.io/ggerganov/llama.cpp:light-rocm`: Same as `light` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
21+
- `ghcr.io/ggerganov/llama.cpp:server-rocm`: Same as `server` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
22+
23+
The GPU enabled images are not currently tested by CI beyond being built. They are not built with any variation from the ones in the Dockerfiles defined in [.devops/](.devops/) and the GitHub Action defined in [.github/workflows/docker.yml](.github/workflows/docker.yml). If you need different settings (for example, a different CUDA or ROCm library, you'll need to build the images locally for now).
24+
25+
## Usage
26+
27+
The easiest way to download the models, convert them to ggml and optimize them is with the --all-in-one command which includes the full docker image.
28+
29+
Replace `/path/to/models` below with the actual path where you downloaded the models.
30+
31+
```bash
32+
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --all-in-one "/models/" 7B
33+
```
34+
35+
On completion, you are ready to play!
36+
37+
```bash
38+
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512
39+
```
40+
41+
or with a light image:
42+
43+
```bash
44+
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512
45+
```
46+
47+
or with a server image:
48+
49+
```bash
50+
docker run -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512
51+
```
52+
53+
## Docker With CUDA
54+
55+
Assuming one has the [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) properly installed on Linux, or is using a GPU enabled cloud, `cuBLAS` should be accessible inside the container.
56+
57+
## Building Docker locally
58+
59+
```bash
60+
docker build -t local/llama.cpp:full-cuda -f .devops/full-cuda.Dockerfile .
61+
docker build -t local/llama.cpp:light-cuda -f .devops/llama-cli-cuda.Dockerfile .
62+
docker build -t local/llama.cpp:server-cuda -f .devops/llama-server-cuda.Dockerfile .
63+
```
64+
65+
You may want to pass in some different `ARGS`, depending on the CUDA environment supported by your container host, as well as the GPU architecture.
66+
67+
The defaults are:
68+
69+
- `CUDA_VERSION` set to `11.7.1`
70+
- `CUDA_DOCKER_ARCH` set to `all`
71+
72+
The resulting images, are essentially the same as the non-CUDA images:
73+
74+
1. `local/llama.cpp:full-cuda`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
75+
2. `local/llama.cpp:light-cuda`: This image only includes the main executable file.
76+
3. `local/llama.cpp:server-cuda`: This image only includes the server executable file.
77+
78+
## Usage
79+
80+
After building locally, Usage is similar to the non-CUDA examples, but you'll need to add the `--gpus` flag. You will also want to use the `--n-gpu-layers` flag.
81+
82+
```bash
83+
docker run --gpus all -v /path/to/models:/models local/llama.cpp:full-cuda --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
84+
docker run --gpus all -v /path/to/models:/models local/llama.cpp:light-cuda -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
85+
docker run --gpus all -v /path/to/models:/models local/llama.cpp:server-cuda -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 1
86+
```

docs/install.md

+39
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Install pre-built version of llama.cpp
2+
3+
## Homebrew
4+
5+
On Mac and Linux, the homebrew package manager can be used via
6+
7+
```sh
8+
brew install llama.cpp
9+
```
10+
The formula is automatically updated with new `llama.cpp` releases. More info: https://github.com/ggerganov/llama.cpp/discussions/7668
11+
12+
## Nix
13+
14+
On Mac and Linux, the Nix package manager can be used via
15+
16+
```sh
17+
nix profile install nixpkgs#llama-cpp
18+
```
19+
For flake enabled installs.
20+
21+
Or
22+
23+
```sh
24+
nix-env --file '<nixpkgs>' --install --attr llama-cpp
25+
```
26+
27+
For non-flake enabled installs.
28+
29+
This expression is automatically updated within the [nixpkgs repo](https://github.com/NixOS/nixpkgs/blob/nixos-24.05/pkgs/by-name/ll/llama-cpp/package.nix#L164).
30+
31+
## Flox
32+
33+
On Mac and Linux, Flox can be used to install llama.cpp within a Flox environment via
34+
35+
```sh
36+
flox install llama-cpp
37+
```
38+
39+
Flox follows the nixpkgs build of llama.cpp.

examples/quantize/README.md

+86-3
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,89 @@ You can also use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-
44

55
Note: It is synced from llama.cpp `main` every 6 hours.
66

7-
## Llama 2 7B
7+
Example usage:
8+
9+
```bash
10+
# obtain the official LLaMA model weights and place them in ./models
11+
ls ./models
12+
llama-2-7b tokenizer_checklist.chk tokenizer.model
13+
# [Optional] for models using BPE tokenizers
14+
ls ./models
15+
<folder containing weights and tokenizer json> vocab.json
16+
# [Optional] for PyTorch .bin models like Mistral-7B
17+
ls ./models
18+
<folder containing weights and tokenizer json>
19+
20+
# install Python dependencies
21+
python3 -m pip install -r requirements.txt
22+
23+
# convert the model to ggml FP16 format
24+
python3 convert_hf_to_gguf.py models/mymodel/
25+
26+
# quantize the model to 4-bits (using Q4_K_M method)
27+
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
28+
29+
# update the gguf filetype to current version if older version is now unsupported
30+
./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
31+
```
32+
33+
Run the quantized model:
34+
35+
```bash
36+
# start inference on a gguf model
37+
./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -n 128
38+
```
39+
40+
When running the larger models, make sure you have enough disk space to store all the intermediate files.
41+
42+
## Memory/Disk Requirements
43+
44+
As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.
45+
46+
| Model | Original size | Quantized size (Q4_0) |
47+
|------:|--------------:|----------------------:|
48+
| 7B | 13 GB | 3.9 GB |
49+
| 13B | 24 GB | 7.8 GB |
50+
| 30B | 60 GB | 19.5 GB |
51+
| 65B | 120 GB | 38.5 GB |
52+
53+
## Quantization
54+
55+
Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
56+
57+
*(outdated)*
58+
59+
| Model | Measure | F16 | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 |
60+
|------:|--------------|-------:|-------:|-------:|-------:|-------:|-------:|
61+
| 7B | perplexity | 5.9066 | 6.1565 | 6.0912 | 5.9862 | 5.9481 | 5.9070 |
62+
| 7B | file size | 13.0G | 3.5G | 3.9G | 4.3G | 4.7G | 6.7G |
63+
| 7B | ms/tok @ 4th | 127 | 55 | 54 | 76 | 83 | 72 |
64+
| 7B | ms/tok @ 8th | 122 | 43 | 45 | 52 | 56 | 67 |
65+
| 7B | bits/weight | 16.0 | 4.5 | 5.0 | 5.5 | 6.0 | 8.5 |
66+
| 13B | perplexity | 5.2543 | 5.3860 | 5.3608 | 5.2856 | 5.2706 | 5.2548 |
67+
| 13B | file size | 25.0G | 6.8G | 7.6G | 8.3G | 9.1G | 13G |
68+
| 13B | ms/tok @ 4th | - | 103 | 105 | 148 | 160 | 131 |
69+
| 13B | ms/tok @ 8th | - | 73 | 82 | 98 | 105 | 128 |
70+
| 13B | bits/weight | 16.0 | 4.5 | 5.0 | 5.5 | 6.0 | 8.5 |
71+
72+
- [k-quants](https://github.com/ggerganov/llama.cpp/pull/1684)
73+
- recent k-quants improvements and new i-quants
74+
- [#2707](https://github.com/ggerganov/llama.cpp/pull/2707)
75+
- [#2807](https://github.com/ggerganov/llama.cpp/pull/2807)
76+
- [#4773 - 2-bit i-quants (inference)](https://github.com/ggerganov/llama.cpp/pull/4773)
77+
- [#4856 - 2-bit i-quants (inference)](https://github.com/ggerganov/llama.cpp/pull/4856)
78+
- [#4861 - importance matrix](https://github.com/ggerganov/llama.cpp/pull/4861)
79+
- [#4872 - MoE models](https://github.com/ggerganov/llama.cpp/pull/4872)
80+
- [#4897 - 2-bit quantization](https://github.com/ggerganov/llama.cpp/pull/4897)
81+
- [#4930 - imatrix for all k-quants](https://github.com/ggerganov/llama.cpp/pull/4930)
82+
- [#4951 - imatrix on the GPU](https://github.com/ggerganov/llama.cpp/pull/4957)
83+
- [#4969 - imatrix for legacy quants](https://github.com/ggerganov/llama.cpp/pull/4969)
84+
- [#4996 - k-qunats tuning](https://github.com/ggerganov/llama.cpp/pull/4996)
85+
- [#5060 - Q3_K_XS](https://github.com/ggerganov/llama.cpp/pull/5060)
86+
- [#5196 - 3-bit i-quants](https://github.com/ggerganov/llama.cpp/pull/5196)
87+
- [quantization tuning](https://github.com/ggerganov/llama.cpp/pull/5320), [another one](https://github.com/ggerganov/llama.cpp/pull/5334), and [another one](https://github.com/ggerganov/llama.cpp/pull/5361)
88+
89+
**Llama 2 7B**
890

991
| Quantization | Bits per Weight (BPW) |
1092
|--------------|-----------------------|
@@ -18,7 +100,8 @@ Note: It is synced from llama.cpp `main` every 6 hours.
18100
| Q5_K_M | 5.68 |
19101
| Q6_K | 6.56 |
20102

21-
## Llama 2 13B
103+
**Llama 2 13B**
104+
22105
Quantization | Bits per Weight (BPW)
23106
-- | --
24107
Q2_K | 3.34
@@ -31,7 +114,7 @@ Q5_K_S | 5.51
31114
Q5_K_M | 5.67
32115
Q6_K | 6.56
33116

34-
# Llama 2 70B
117+
**Llama 2 70B**
35118

36119
Quantization | Bits per Weight (BPW)
37120
-- | --

0 commit comments

Comments
 (0)