Skip to content

Commit 5376991

Browse files
Add note about tiktoken prechunking
1 parent dc4338d commit 5376991

File tree

1 file changed

+6
-1
lines changed

1 file changed

+6
-1
lines changed

crates/bpe/README.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -183,7 +183,12 @@ On average it is about ~4 faster, since the short-cuts usually pay off.
183183

184184
## Benchmarks
185185

186-
We ran several benchmarks to compare performance of different encoders and the [tiktoken-rs](https://crates.io/crates/tiktoken-rs) library (a wrapper around OpenAI's tiktoken implementation):
186+
We ran several benchmarks to compare performance of different encoders and a tiktoken implementation.
187+
For the tiktoken implementation we used [tiktoken-rs](https://crates.io/crates/tiktoken-rs) library, a wrapper around OpenAI's tiktoken implementation.
188+
Note that tiktoken does not run BPE on the full input text.
189+
Instead it splits it into large chunks using a regex and runs BPE on the individual chunks.
190+
We have not tried to see if that approach is compatible with our BPE implementation.
191+
We benchmarked the following scenarios:
187192

188193
- The first measures encoding runtime for our different encoders and the tiktoken Rust implementation.
189194
This shows a ~3.5x performance improvement for our fastest correct encoder compared to the tiktoken library.

0 commit comments

Comments
 (0)