You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: crates/bpe/README.md
+14-14Lines changed: 14 additions & 14 deletions
Original file line number
Diff line number
Diff line change
@@ -185,37 +185,37 @@ On average it is about ~4 faster, since the short-cuts usually pay off.
185
185
186
186
We ran several benchmarks to compare performance between different encoders and with the tiktoken library:
187
187
188
-
- The first measuers encoding runtime for our different encoders and the tiktoken Rust implementation.
189
-
This shows a ~3.5x performance increase for our fastest correct encoder comapred to the tiktoken library.
188
+
- The first measures encoding runtime for our different encoders and the tiktoken Rust implementation.
189
+
This shows a ~3.5x performance improvement for our fastest correct encoder compared to the tiktoken library.
190
190
191
191
- The second measures incremental encoding runtime, where the text is built up byte-by-byte.
192
192
This mode is not available in tiktoken, which only supports counting/encoding a complete text.
193
193
194
-
- The third measures interval counting runtime, where the token count for slices of an original text are determined.
195
-
After the initial tokenization of the text, token counting for slices is typically constant time.
194
+
- The third measures interval counting runtime, where tokens of sub-slices of a fixed text are counted.
195
+
The data structure we built specifically for this purpose can answer those interval counting requests in typically constant times after the initial linear preprocessing of the text.
196
196
This mode is not available in tiktoken, which only supports counting/encoding a complete text.
197
197
198
-
All benchmarks were run on a MacBook Pro M1.
198
+
All benchmarks were run single-threaded on a MacBook Pro M1.
199
199
200
200
### Encoding
201
201
202
202
Encoding is computing the tokens for a given text.
203
-
This benchmark uses several encoders:
203
+
This benchmark compares several encoders:
204
204
205
-
- The backtracking encoder uses a backtracking algorithm based on a string matching automaton.
206
-
- The heap encoder uses a priority heap to implement the traditional BPE algorithm.
207
-
- The table encoder uses a dynamic programming algorithm.
205
+
- The backtracking encoder uses the backtracking algorithm with memorisation based on top of a string matching automaton.
206
+
- The heap encoder uses a priority heap and a bitmask to represent token positions to implement the traditional BPE algorithm.
207
+
- The table encoder implements the raw dynamic programming algorithm proposed above.
208
208
209
-
Two additional encoders are included that are faster but do not always give exact results:
209
+
Two additional encoders are included that are faster but deviate from the original BPE encoding strategy:
210
210
211
-
- The greedy encoder uses a left-to-right greedy algorithm.
211
+
- The greedy encoder picks the left-longest token.
212
212
- The minimal encoder computes an encoding with the minimal number of tokens.
213
213
214
214
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 1000 from a random 20000 token original text using the o200k token set.
215
215
(All encodings were computed from scratch for each slice.)
216
216
217
217
The graph below shows encoding runtime vs slice length.
218
-
All encoders show similar runtime increases with increasing slice length.
218
+
All encoders (except the heap encoder) show the expected linear runtime complexity.
219
219
The backtracking encoder, the fastest encoder that still returns correct results, shows a performance gain of approximately 3.5x compared to tiktoken.
220
220
The fully dynamic programming solution and the heap implementation are still quite competitive to TikToken (especially for smaller inputs).
221
221
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.
@@ -224,7 +224,7 @@ If the requirement of correct BPE output can be relaxed, then the Greedy approac
224
224
225
225
### Incremental encoding
226
226
227
-
Incremental encoding tokenizes a text to which bytes are appended.
227
+
Incremental encoding tokenizes a text while appending bytes. This type of algorithm is interesting for use cases where a certain token budget must not be exceeded.
228
228
This benchmark uses two encoders:
229
229
230
230
- The backtracking encoder, which retokenizes the text froms cratch every time it changes.
@@ -251,7 +251,7 @@ This benchmark uses two encoders:
251
251
- The interval encoder encodes the original text once and reuses that encoding to count tokens for intervals of the original text.
252
252
The initial encoding time for the interval encoder is comparable to that of the backtracking encoder.
253
253
254
-
The benchmark measured the runtime of counting o200k tokens on slices of lengths 10, 100, 1000, and 1000 from a random 20000 token original text.
254
+
The benchmark measured the runtime of counting o200k tokens on slices of lengths 10, 100, 1000, and 10000 from a random 20000 token original text.
255
255
256
256
The graph below shows counting runtime vs slice length.
257
257
The runtime of the backtracking encoder grows with the length of the slice.
0 commit comments