Local LLM-assisted text completion.
- Auto-suggest on cursor movement in 
Insertmode - Toggle the suggestion manually by pressing 
Ctrl+F - Accept a suggestion with 
Tab - Accept the first line of a suggestion with 
Shift+Tab - Control max text generation time
 - Configure scope of context around the cursor
 - Ring context with chunks from open and edited files and yanked text
 - Supports very large contexts even on low-end hardware via smart context reuse
 - Display performance stats
 
Plug 'ggml-org/llama.vim'cd ~/.vim/bundle
git clone https://github.com/ggml-org/llama.vimThen add Plugin 'llama.vim' to your .vimrc in the vundle#begin() section.
The plugin requires a llama.cpp server instance to be running at g:llama_config.endpoint
brew install llama.cppEither build from source or use the latest binaries: https://github.com/ggerganov/llama.cpp/releases
Here are recommended settings, depending on the amount of VRAM that you have:
- 
More than 16GB VRAM:
llama-server \ -hf ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF \ --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 \ --ctx-size 0 --cache-reuse 256 - 
Less than 16GB VRAM:
llama-server \ -hf ggml-org/Qwen2.5-Coder-3B-Q8_0-GGUF \ --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 \ --ctx-size 0 --cache-reuse 256 - 
Less than 8GB VRAM:
llama-server \ -hf ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF \ --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 \ --ctx-size 0 --cache-reuse 256 
Use :help llama for more details.
The plugin requires FIM-compatible models: HF collection
The orange text is the generated suggestion. The green text contains performance stats for the FIM request: the currently used context is 15186 tokens and the maximum is 32768. There are 30 chunks in the ring buffer with extra context (out of 64). So far, 1 chunk has been evicted in the current session and there are 0 chunks in queue. The newly computed prompt tokens for this request were 260 and the generated tokens were 25. It took 1245 ms to generate this suggestion after entering the letter c on the current line.
llama.vim-0-lq.mp4
Demonstrates that the global context is accumulated and maintained across different files and showcases the overall latency when working in a large codebase.
The plugin aims to be very simple and lightweight and at the same time to provide high-quality and performant local FIM completions, even on consumer-grade hardware. Read more on how this is achieved in the following links:
- Initial implementation and techincal description: ggml-org/llama.cpp#9787
 - Classic Vim support: ggml-org/llama.cpp#9995
 
