Skip to content

Commit 5703f8d

Browse files
committed
Add examples
1 parent 470bbd4 commit 5703f8d

File tree

9 files changed

+223
-4
lines changed

9 files changed

+223
-4
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@
2626
*.a
2727
*.lib
2828

29+
build
30+
2931
# Executables
3032
*.exe
3133
*.out

README.md

Lines changed: 82 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,84 @@
11
# tokenizers-cpp
22

3-
Cross platform universal tokenizer binding to HF and sentencepiece
3+
This project provides a cross-platform C++ tokenizer binding library that can be universally deployed.
4+
It wraps and binds the [HuggingFace tokenizers library](https://github.com/huggingface/tokenizers)
5+
and [sentencepiece](https://github.com/google/sentencepiece) and provides a minimum common interface in C++.
6+
7+
The main goal of the project is to enable tokenizer deployment for language model applications
8+
to native platforms with minimum dependencies and remove some of the barriers of
9+
cross-language bindings. This project is developed in part with and
10+
used in [MLC LLM](https://github.com/mlc-ai/mlc-llm). We have tested the following platforms:
11+
12+
- iOS
13+
- Android
14+
- Windows
15+
- Linux
16+
- Web browser
17+
18+
## Getting Started
19+
20+
The easiest way is to add this project as a submodule and then
21+
include it via `add_sub_directory` in your CMake project.
22+
You also need to turn on `c++17` support.
23+
24+
- First, you need to make sure you have rust installed.
25+
- If you are cross-compiling make sure you install the necessary target in rust.
26+
For example, run `rustup target add aarch64-apple-ios` to install iOS target.
27+
- You can then link the libary
28+
29+
See [example](example) folder for an example CMake project.
30+
31+
### Example Code
32+
33+
```c++
34+
// - dist/tokenizer.json
35+
void HuggingFaceTokenizerExample() {
36+
// Read blob from file.
37+
auto blob = LoadBytesFromFile("dist/tokenizer.json");
38+
// Note: all the current factory APIs takes in-memory blob as input.
39+
// This gives some flexibility on how these blobs can be read.
40+
auto tok = Tokenizer::FromBlobJSON(blob);
41+
std::string prompt = "What is the capital of Canada?";
42+
// call Encode to turn prompt into token ids
43+
std::vector<int> ids = tok->Encode(prompt);
44+
// call Decode to turn ids into string
45+
std::string decoded_prompt = tok->Decode(ids);
46+
}
47+
48+
void SentencePieceTokenizerExample() {
49+
// Read blob from file.
50+
auto blob = LoadBytesFromFile("dist/tokenizer.model");
51+
// Note: all the current factory APIs takes in-memory blob as input.
52+
// This gives some flexibility on how these blobs can be read.
53+
auto tok = Tokenizer::FromBlobSentencePiece(blob);
54+
std::string prompt = "What is the capital of Canada?";
55+
// call Encode to turn prompt into token ids
56+
std::vector<int> ids = tok->Encode(prompt);
57+
// call Decode to turn ids into string
58+
std::string decoded_prompt = tok->Decode(ids);
59+
}
60+
```
61+
62+
### Extra Details
63+
64+
Currently, the project generates three static libraries
65+
- `libtokenizers_c.a`: the c binding to tokenizers rust library
66+
- `libsentencepice.a`: sentencepiece static library
67+
- `libtokenizers_cpp.a`: the cpp binding implementation
68+
69+
If you are using an IDE, you can likely first use cmake to generate
70+
these libraries and add them to your development environment.
71+
If you are using cmake, `target_link_libraries(yourlib tokenizers_cpp)`
72+
will automatically links in the other two libraries.
73+
You can also checkout [MLC LLM](https://github.com/mlc-ai/mlc-llm)
74+
for as an example of complete LLM chat application integrations.
75+
76+
## Javascript Support
77+
78+
We use emscripten to expose tokenizer-cpp to wasm and javascript.
79+
Checkout [web](web) for more details.
80+
81+
## Acknowledgements
82+
83+
This project is only possible thanks to the shoulders open-source ecosystems that we stand on.
84+
This project is based on sentencepiece and tokenizers library.

example/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
build
2+
dist

example/CMakeLists.txt

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
2+
# Example cmake project
3+
cmake_minimum_required(VERSION 3.18)
4+
project(tokenizers_cpp_example C CXX)
5+
6+
include(CheckCXXCompilerFlag)
7+
if(NOT MSVC)
8+
check_cxx_compiler_flag("-std=c++17" SUPPORT_CXX17)
9+
set(CMAKE_CXX_FLAGS "-std=c++17 ${CMAKE_CXX_FLAGS}")
10+
set(CMAKE_CUDA_STANDARD 17)
11+
else()
12+
check_cxx_compiler_flag("/std:c++17" SUPPORT_CXX17)
13+
set(CMAKE_CXX_FLAGS "/std:c++17 ${CMAKE_CXX_FLAGS}")
14+
set(CMAKE_CUDA_STANDARD 17)
15+
endif()
16+
17+
# include tokenizer cpp as a sub directory
18+
set(TOKENZIER_CPP_PATH ..)
19+
add_subdirectory(${TOKENZIER_CPP_PATH} tokenizers EXCLUDE_FROM_ALL)
20+
21+
add_executable(example example.cc)
22+
23+
target_include_directories(example PRIVATE ${TOKENZIER_CPP_PATH}/include)
24+
25+
# You can link tokenizers_cpp, it will automatically link tokenizers_c
26+
# and sentencepiece libary
27+
target_link_libraries(example PRIVATE tokenizers_cpp)
28+

example/README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Example Project
2+
3+
This is an example cmake project to use tokenizers-cpp
4+
5+
6+
```bash
7+
./build_and_run.sh
8+
```

example/build_and_run.sh

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
#/bin/bash
2+
3+
# build
4+
mkdir -p build
5+
cd build
6+
cmake ..
7+
make -j8
8+
cd ..
9+
# get example files
10+
11+
mkdir -p dist
12+
cd dist
13+
if [ ! -f "tokenizer.model" ]; then
14+
wget https://huggingface.co/decapoda-research/llama-7b-hf/resolve/main/tokenizer.model
15+
fi
16+
if [ ! -f "tokenizer.json" ]; then
17+
wget https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1/resolve/main/tokenizer.json
18+
fi
19+
cd ..
20+
21+
# run
22+
echo "---Running example----"
23+
./build/example

example/example.cc

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
#include <tokenizers_cpp.h>
2+
3+
#include <fstream>
4+
#include <iostream>
5+
#include <string>
6+
7+
using tokenizers::Tokenizer;
8+
9+
std::string LoadBytesFromFile(const std::string& path) {
10+
std::ifstream fs(path, std::ios::in | std::ios::binary);
11+
if (fs.fail()) {
12+
std::cerr << "Cannot open " << path << std::endl;
13+
exit(1);
14+
}
15+
std::string data;
16+
fs.seekg(0, std::ios::end);
17+
size_t size = static_cast<size_t>(fs.tellg());
18+
fs.seekg(0, std::ios::beg);
19+
data.resize(size);
20+
fs.read(data.data(), size);
21+
return data;
22+
}
23+
24+
void PrintEncodeResult(const std::vector<int>& ids) {
25+
std::cout << "tokens=[";
26+
for (size_t i = 0; i < ids.size(); ++i) {
27+
if (i != 0) std::cout << ", ";
28+
std::cout << ids[i];
29+
}
30+
std::cout << "]" << std::endl;
31+
}
32+
33+
// Sentencepiece tokenizer
34+
// - dist/tokenizer.model
35+
void SentencePieceTokenizerExample() {
36+
// Read blob from file.
37+
auto blob = LoadBytesFromFile("dist/tokenizer.model");
38+
// Note: all the current factory APIs takes in-memory blob as input.
39+
// This gives some flexibility on how these blobs can be read.
40+
auto tok = Tokenizer::FromBlobSentencePiece(blob);
41+
std::string prompt = "What is the capital of Canada?";
42+
// call Encode to turn prompt into token ids
43+
std::vector<int> ids = tok->Encode(prompt);
44+
// call Decode to turn ids into string
45+
std::string decoded_prompt = tok->Decode(ids);
46+
47+
// print encoded result
48+
std::cout << "SetencePiece tokenizer: " << std::endl;
49+
PrintEncodeResult(ids);
50+
std::cout << "decode=\"" << decoded_prompt << "\"" << std::endl;
51+
}
52+
53+
// HF tokenizer
54+
// - dist/tokenizer.json
55+
void HuggingFaceTokenizerExample() {
56+
// Read blob from file.
57+
auto blob = LoadBytesFromFile("dist/tokenizer.json");
58+
// Note: all the current factory APIs takes in-memory blob as input.
59+
// This gives some flexibility on how these blobs can be read.
60+
auto tok = Tokenizer::FromBlobJSON(blob);
61+
std::string prompt = "What is the capital of Canada?";
62+
// call Encode to turn prompt into token ids
63+
std::vector<int> ids = tok->Encode(prompt);
64+
// call Decode to turn ids into string
65+
std::string decoded_prompt = tok->Decode(ids);
66+
67+
// print encoded result
68+
std::cout << "HF tokenizer: " << std::endl;
69+
PrintEncodeResult(ids);
70+
std::cout << "decode=\"" << decoded_prompt << "\"" << std::endl;
71+
}
72+
73+
int main(int argc, char* argv[]) {
74+
SentencePieceTokenizerExample();
75+
HuggingFaceTokenizerExample();
76+
}

src/tokenizers_c.h renamed to include/tokenizers_c.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,4 +37,4 @@ void tokenizers_free(TokenizerHandle handle);
3737
#ifdef __cplusplus
3838
}
3939
#endif
40-
#endif // TOKENIZERS_C_H_
40+
#endif // TOKENIZERS_C_H_

src/huggingface_tokenizer.cc

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,9 @@
44
* \file huggingface_tokenizer.cc
55
* \brief Huggingface tokenizer
66
*/
7+
#include <tokenizers_c.h>
78
#include <tokenizers_cpp.h>
89

9-
#include "tokenizers_c.h"
10-
1110
namespace tokenizers {
1211
/*!
1312
* \brief A simple c++ header of tokenizer via C API.

0 commit comments

Comments
 (0)