Skip to content

Commit dc4338d

Browse files
Rephrase incremental benchmark description
1 parent 8c9e05b commit dc4338d

File tree

1 file changed

+8
-12
lines changed

1 file changed

+8
-12
lines changed

crates/bpe/README.md

Lines changed: 8 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -211,7 +211,7 @@ Two additional encoders are included that are faster but deviate from the origin
211211
- The greedy encoder picks the left-longest token.
212212
- The minimal encoder computes an encoding with the minimal number of tokens.
213213

214-
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 1000 from a random 20000 token original text using the o200k token set.
214+
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 10000 from a random 20000 token original text using the o200k token set.
215215
(All encodings were computed from scratch for each slice.)
216216

217217
The graph below shows encoding runtime vs slice length.
@@ -224,20 +224,16 @@ If the requirement of correct BPE output can be relaxed, then the Greedy approac
224224

225225
### Incremental encoding
226226

227-
Incremental encoding tokenizes a text while appending bytes. This type of algorithm is interesting for use cases where a certain token budget must not be exceeded.
228-
This benchmark uses two encoders:
229-
230-
- The backtracking encoder, which retokenizes the text froms cratch every time it changes.
231-
- The appending encoder, which supports incremental encoding when bytes are added.
227+
Incremental encoding tokenizes a text while appending bytes.
228+
This type of algorithm is interesting for use cases where a certain token budget must not be exceeded.
229+
This benchmark shows the runtime for the appending encoder when a text is encoded byte-by-byte.
230+
For comparison we show the runtime of the backtracking encoder when it encodes the whole text at once.
232231

233-
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 1000 from a random 20000 token original using the o200k token set.
234-
The backtracking encoder encoded the final text in one go.
235-
The appending encoder got the text bytes on by one.
232+
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 10000 from a random 20000 token original using the o200k token set.
236233

237234
The graph below shows encoding runtime vs slice length.
238-
Runtime of both encoders grows similarly with slice length.
239-
The incremental encoder shows a constant factor overhead.
240-
Note that this is still a huge win for incremental use cases, which would otherwise require retokenization after each append, resulting in a quadratic slowdown.
235+
The overall runtime of byte-by-byte incremental encoder for encoding the full text is comparable to the runtime of the backtracking encoder, with only a constant factor overhead.
236+
Note that this is a huge win for incremental use cases, which would otherwise require retokenization after each append, resulting in a quadratic slowdown.
241237

242238
![appending runtime comparison](./benches/result/appending-o200k.svg)
243239

0 commit comments

Comments
 (0)