Run GGUF models on your android app with ease!
This is a Android binding for llama.cpp written in Kotlin, designed for native Android applications. This project is inspired (forked) by cui-llama.rn and llama.cpp: Inference of LLaMA model in pure C/C++but specifically tailored for Android development in Kotlin.
This is a very early alpha version and API may change in the future.
- Helper class to handle initialization and context handling
- Native Kotlin bindings for llama.cpp
- Support for stopping prompt processing between batches
- Vocabulary-only mode for tokenizer functionality
- Synchronous tokenizer functions
- Context Shift support (from kobold.cpp)
- XTC sampling implementation
- Progress callback support
- CPU feature detection (i8mm and dotprod flags)
- Seamless integration with Android development workflow
Add the following to your project's build.gradle
:
dependencies {
implementation 'io.github.ljcamargo:llamacpp-kotlin:0.1.0'
}
You'll need a GGUF model file to use this library. You can:
- Download pre-converted GGUF models from HuggingFace
- Convert your own models following the llama.cpp quantization guide
Check this example ViewModel using LlamaHelper class for basic usage
class MainViewModel: ViewModel() {
private val viewModelJob = SupervisorJob()
private val scope = CoroutineScope(Dispatchers.IO + viewModelJob)
private val llamaHelper by lazy { LlamaHelper(scope) }
val text = MutableStateFlow("")
// load model into memory
suspend fun loadModel() {
llamaHelper.load(
path = "/sdcard/Download/llama.ggmlv3.q4_0.bin",
contextLength = 2048,
)
}
// model should be loaded before submitting or an exception will be thrown
suspend fun submit(prompt: String) {
// collector must be called before predict
llamaHelper.setCollector()
.onStart {
Log.i("MainViewModel", "prediction started")
// prediction started, prepare your UI
// the first token will arrive after some seconds of warmup
text.emit("")
}
.onCompletion {
Log.i("MainViewModel", "prediction ended")
// onCompletion will be triggered when finished or aborted
llamaHelper.unsetCollector() // unset collector
}
.collect { chunk ->
Log.i("MainViewModel", "prediction $chunk")
// collect chunks of text as they arrive
// you can, for example, emit to a StateFlow to observe it in your UI
text.value += chunk
}
llamaHelper.predict(
prompt = prompt,
partialCompletion = true
)
}
// you can abort the model load or prediction in progress
fun abort() {
Log.i("MainViewModel", "prediction ended")
llamaHelper.abort()
}
// don't forget to release resources when your viewmodel is destroyed
override fun onCleared() {
super.onCleared()
llamaHelper.abort()
llamaHelper.release()
}
}
You can also use LlamaContext.kt directly to handle several contexts or other complex features
- The library currently supports arm64-v8a and x86_64 platforms
- 64-bit platforms are recommended for better memory allocation
- CPU feature detection helps optimize performance based on device capabilities
- Batch processing can be interrupted, which is crucial for mobile devices with limited processing power
Contributions are welcome! Please feel free to submit a Pull Request.
MIT
This project builds upon the work of several excellent projects:
- llama.cpp by Georgi Gerganov
- cui-llama.rn
- llama.rn