Skip to content

Commit d9b2bee

Browse files
Text improvements and fixes
Co-authored-by: Alexander Neubeck <[email protected]>
1 parent 0fdb60f commit d9b2bee

File tree

1 file changed

+14
-14
lines changed

1 file changed

+14
-14
lines changed

crates/bpe/README.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -185,37 +185,37 @@ On average it is about ~4 faster, since the short-cuts usually pay off.
185185

186186
We ran several benchmarks to compare performance between different encoders and with the tiktoken library:
187187

188-
- The first measuers encoding runtime for our different encoders and the tiktoken Rust implementation.
189-
This shows a ~3.5x performance increase for our fastest correct encoder comapred to the tiktoken library.
188+
- The first measures encoding runtime for our different encoders and the tiktoken Rust implementation.
189+
This shows a ~3.5x performance improvement for our fastest correct encoder compared to the tiktoken library.
190190

191191
- The second measures incremental encoding runtime, where the text is built up byte-by-byte.
192192
This mode is not available in tiktoken, which only supports counting/encoding a complete text.
193193

194-
- The third measures interval counting runtime, where the token count for slices of an original text are determined.
195-
After the initial tokenization of the text, token counting for slices is typically constant time.
194+
- The third measures interval counting runtime, where tokens of sub-slices of a fixed text are counted.
195+
The data structure we built specifically for this purpose can answer those interval counting requests in typically constant times after the initial linear preprocessing of the text.
196196
This mode is not available in tiktoken, which only supports counting/encoding a complete text.
197197

198-
All benchmarks were run on a MacBook Pro M1.
198+
All benchmarks were run single-threaded on a MacBook Pro M1.
199199

200200
### Encoding
201201

202202
Encoding is computing the tokens for a given text.
203-
This benchmark uses several encoders:
203+
This benchmark compares several encoders:
204204

205-
- The backtracking encoder uses a backtracking algorithm based on a string matching automaton.
206-
- The heap encoder uses a priority heap to implement the traditional BPE algorithm.
207-
- The table encoder uses a dynamic programming algorithm.
205+
- The backtracking encoder uses the backtracking algorithm with memorisation based on top of a string matching automaton.
206+
- The heap encoder uses a priority heap and a bitmask to represent token positions to implement the traditional BPE algorithm.
207+
- The table encoder implements the raw dynamic programming algorithm proposed above.
208208

209-
Two additional encoders are included that are faster but do not always give exact results:
209+
Two additional encoders are included that are faster but deviate from the original BPE encoding strategy:
210210

211-
- The greedy encoder uses a left-to-right greedy algorithm.
211+
- The greedy encoder picks the left-longest token.
212212
- The minimal encoder computes an encoding with the minimal number of tokens.
213213

214214
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 1000 from a random 20000 token original text using the o200k token set.
215215
(All encodings were computed from scratch for each slice.)
216216

217217
The graph below shows encoding runtime vs slice length.
218-
All encoders show similar runtime increases with increasing slice length.
218+
All encoders (except the heap encoder) show the expected linear runtime complexity.
219219
The backtracking encoder, the fastest encoder that still returns correct results, shows a performance gain of approximately 3.5x compared to tiktoken.
220220
The fully dynamic programming solution and the heap implementation are still quite competitive to TikToken (especially for smaller inputs).
221221
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.
@@ -224,7 +224,7 @@ If the requirement of correct BPE output can be relaxed, then the Greedy approac
224224

225225
### Incremental encoding
226226

227-
Incremental encoding tokenizes a text to which bytes are appended.
227+
Incremental encoding tokenizes a text while appending bytes. This type of algorithm is interesting for use cases where a certain token budget must not be exceeded.
228228
This benchmark uses two encoders:
229229

230230
- The backtracking encoder, which retokenizes the text froms cratch every time it changes.
@@ -251,7 +251,7 @@ This benchmark uses two encoders:
251251
- The interval encoder encodes the original text once and reuses that encoding to count tokens for intervals of the original text.
252252
The initial encoding time for the interval encoder is comparable to that of the backtracking encoder.
253253

254-
The benchmark measured the runtime of counting o200k tokens on slices of lengths 10, 100, 1000, and 1000 from a random 20000 token original text.
254+
The benchmark measured the runtime of counting o200k tokens on slices of lengths 10, 100, 1000, and 10000 from a random 20000 token original text.
255255

256256
The graph below shows counting runtime vs slice length.
257257
The runtime of the backtracking encoder grows with the length of the slice.

0 commit comments

Comments
 (0)