You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: crates/bpe/README.md
+8-12Lines changed: 8 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -211,7 +211,7 @@ Two additional encoders are included that are faster but deviate from the origin
211
211
- The greedy encoder picks the left-longest token.
212
212
- The minimal encoder computes an encoding with the minimal number of tokens.
213
213
214
-
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 1000 from a random 20000 token original text using the o200k token set.
214
+
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 10000 from a random 20000 token original text using the o200k token set.
215
215
(All encodings were computed from scratch for each slice.)
216
216
217
217
The graph below shows encoding runtime vs slice length.
@@ -224,20 +224,16 @@ If the requirement of correct BPE output can be relaxed, then the Greedy approac
224
224
225
225
### Incremental encoding
226
226
227
-
Incremental encoding tokenizes a text while appending bytes. This type of algorithm is interesting for use cases where a certain token budget must not be exceeded.
228
-
This benchmark uses two encoders:
229
-
230
-
- The backtracking encoder, which retokenizes the text froms cratch every time it changes.
231
-
- The appending encoder, which supports incremental encoding when bytes are added.
227
+
Incremental encoding tokenizes a text while appending bytes.
228
+
This type of algorithm is interesting for use cases where a certain token budget must not be exceeded.
229
+
This benchmark shows the runtime for the appending encoder when a text is encoded byte-by-byte.
230
+
For comparison we show the runtime of the backtracking encoder when it encodes the whole text at once.
232
231
233
-
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 1000 from a random 20000 token original using the o200k token set.
234
-
The backtracking encoder encoded the final text in one go.
235
-
The appending encoder got the text bytes on by one.
232
+
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 10000 from a random 20000 token original using the o200k token set.
236
233
237
234
The graph below shows encoding runtime vs slice length.
238
-
Runtime of both encoders grows similarly with slice length.
239
-
The incremental encoder shows a constant factor overhead.
240
-
Note that this is still a huge win for incremental use cases, which would otherwise require retokenization after each append, resulting in a quadratic slowdown.
235
+
The overall runtime of byte-by-byte incremental encoder for encoding the full text is comparable to the runtime of the backtracking encoder, with only a constant factor overhead.
236
+
Note that this is a huge win for incremental use cases, which would otherwise require retokenization after each append, resulting in a quadratic slowdown.
0 commit comments