You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -4,6 +4,7 @@ The main purpose of this library is to provide fast and correct token counting f
4
4
As a by-product, it can also be used to efficiently encode those chunks if desired.
5
5
6
6
For chunking the following operations are of interest:
7
+
7
8
1) Split text after exactly n tokens at a character boundary.
8
9
1) Count tokens for sub-ranges of a text.
9
10
1) Incrementally count tokens while appending text to a chunk.
@@ -29,6 +30,7 @@ This library presents novel algorithms to compute BPE encodings which address th
29
30
## Prior Art
30
31
31
32
There are mostly three strategies for BPE encoding.
33
+
32
34
1) Trivial solution. Search brute force for the most frequent pair in the encoded text according the dictionary and replace those occurrences. This has a `O(n^2)` complexity and is therefore not very appealing in production.
33
35
2) Heap based. Set up a heap with the frequencies. This improves the linear search time to a logarithmic factor. If done properly, the overall complexity reduces now to `O(n log n)`.
34
36
3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible.
@@ -89,38 +91,38 @@ If BPE wants to make a different merge decision when it sees the full input, the
89
91
90
92
Given a valid encoding sequence `e_0..e_i` and a valid encoding tuple `e_i e_j`, then `e_0..e_i e_j` is also a valid encoding sequence.
91
93
92
-
93
94
## Novel Algorithm
94
95
95
96
At a first glance, it seems impossible to achieve `O(n)` complexity while preserving the encoding output of the original BPE algorithm, since the original BPE algorithm needs to first scan the full input before it can make any encoding decision.
96
-
For instance, the sequence `abab` would be encoded as `ab ab` when the dictionary contains the tokens `a b ab ba bc abc babc ababc` ordered by frequency. But appending a single character `ababc` would result in a pretty different tokenization: `ab a bc`. So without looking ahead it seems impossible to properly tokenize the text.
97
+
For instance, the sequence `abac` would be encoded as `ab ac` when the dictionary contains the tokens `a b c ab cb ac` ordered by frequency. But appending a single character `abacb` would result in a pretty different tokenization: `ab a cb`. So without looking ahead it seems impossible to properly tokenize the text.
98
+
99
+
The solution is to track the encodings of ALL text prefixes. For our example `abacb` we would get:
97
100
98
-
The solution is to track the encodings of ALL text prefixes. For our example `ababc` we would get:
99
101
-`a` ------> `a`
100
102
-`ab` -----> `ab`
101
103
-`aba` ----> `ab a`
102
-
-`abab` ---> `ab ab`
103
-
-`ababc` --> `ab a bc`
104
+
-`abab` ---> `ab ac`
105
+
-`ababc` --> `ab a cb`
104
106
105
107
This can be done much more efficiently thanks to Corollary IIa, since now only the last token of every prefix has to be remembered:
106
108
107
109
-`a` ------> `a`
108
110
-`ab` -----> `ab`
109
111
-`aba` ----> `a`
110
-
-`abab` ---> `ab`
111
-
-`ababc` --> `bc`
112
+
-`abac` ---> `ac`
113
+
-`abacb` --> `bc`
112
114
113
115
In order to reconstruct the full encoding for a specific prefix, one simply starts with the last token of that prefix, shortens the prefix by the extracted token and looks up the token associated with the shortened prefix and so on until the beginning of the text is reached.
114
116
115
-
For our example prefix `ababc`, this procedure executes the following steps and determines the correct encoding in reverse order:
117
+
For our example prefix `abacb`, this procedure executes the following steps and determines the correct encoding in reverse order:
116
118
117
-
-`ababc` -> `bc`
119
+
-`abacb` -> `cb`
118
120
-`aba` ---> `a`
119
121
-`ab` ----> `ab`
120
122
-`<empty>`
121
123
122
124
The actual challenge is to determine for every prefix this last token efficiently.
123
-
The prefix `abab` could for instance end with either the token `b` or `ab`, but only `ab` leads to a valid encoding sequence.
125
+
The prefix `abac` could for instance end with either the token `c` or `ac`, but only `ac` leads to a valid encoding sequence.
124
126
But, Corollary IIa tells us that **one and only one** last token can be the correct one and Corollary IIIa shows us how to find it:
125
127
We only have to check whether a possible next token is "compatible" with its previous token, i.e. whether the two tokens form a valid encoding sequence.
126
128
@@ -136,6 +138,7 @@ Once that happens the reencoding will be different and the algorithm can stop.
136
138
The actual implementation needs essentially at most 14 lookups for the most complex cases to determine whether two tokens are compatible or not.
137
139
138
140
Putting all these pieces together leads to the following algorithmic sketch:
141
+
139
142
```rust
140
143
letlast_tokens=vec![];
141
144
forposin0..text.len() {
@@ -166,6 +169,7 @@ The main observation is that often the greedy heuristic picks already the correc
166
169
In the cases, where it doesn't the algorithm has to somehow backtrack to the next tokenization until it converged to the correct solution.
167
170
168
171
Our backtracking implementation solves the enumeration problem as follows:
172
+
169
173
1) If the current tokenization sequence is valid, then append the longest matching token to the right.
170
174
2) Otherwise, replace the right most token with the next longest prefix token.
171
175
3) If there is no such token, then remove that token and go back to step 2.
@@ -179,18 +183,96 @@ On average it is about ~4 faster, since the short-cuts usually pay off.
179
183
180
184
## Benchmarks
181
185
182
-
We compared our implementations with the tiktoken implementation on a MacBook Pro on a random input sequence:
183
-
184
-
| Algorithm | Runtime | correct BPE output |
185
-
| ------------ | -------- | ---------- |
186
-
| Greedy | 100 µs | ✘ |
187
-
| Minimal | 300 µs | ✘ |
188
-
| Backtracking | 400 µs | ✔ |
189
-
| Dynamic Programming | 1300 µs | ✔ |
190
-
| TikToken | 1500 µs | ✘ |
191
-
| Heap | 1900 µs | ✔ |
192
-
193
-
As can be seen, our Backtracking implementation beats the TikToken Rust implementation by ~4x.
194
-
And even the fully dynamic programming solution is faster with a more consistent runtime.
195
-
The tuned heap implementation is still quite competitive to TikToken (especially for smaller inputs).
196
-
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.
186
+
We ran several benchmarks to compare performance of different encoders and a tiktoken implementation.
187
+
For the tiktoken implementation we used [tiktoken-rs](https://crates.io/crates/tiktoken-rs) library, a wrapper around OpenAI's tiktoken implementation.
188
+
Note that tiktoken does not run BPE on the full input text.
189
+
Instead it splits it into large chunks using a regex and runs BPE on the individual chunks.
190
+
We have not tried to see if that approach is compatible with our BPE implementation.
191
+
We benchmarked the following scenarios:
192
+
193
+
- The first measures encoding runtime for our different encoders and the tiktoken Rust implementation.
194
+
This shows a ~3.5x performance improvement for our fastest correct encoder compared to the tiktoken library.
195
+
196
+
- The second measures incremental encoding runtime, where the text is built up byte-by-byte.
197
+
This mode is not available in tiktoken, which only supports counting/encoding a complete text.
198
+
199
+
- The third measures interval counting runtime, where tokens of sub-slices of a fixed text are counted.
200
+
The data structure we built specifically for this purpose can answer those interval counting requests in typically constant times after the initial linear preprocessing of the text.
201
+
This mode is not available in tiktoken, which only supports counting/encoding a complete text.
202
+
203
+
All benchmarks were run single-threaded on a MacBook Pro M1.
204
+
205
+
### Encoding
206
+
207
+
Encoding is computing the tokens for a given text.
208
+
This benchmark compares several encoders:
209
+
210
+
- The backtracking encoder uses the backtracking algorithm with memorisation based on top of a string matching automaton.
211
+
- The heap encoder uses a priority heap and a bitmask to represent token positions to implement the traditional BPE algorithm.
212
+
- The table encoder implements the raw dynamic programming algorithm proposed above.
213
+
214
+
Two additional encoders are included that are faster but deviate from the original BPE encoding strategy:
215
+
216
+
- The greedy encoder picks the left-longest token.
217
+
- The minimal encoder computes an encoding with the minimal number of tokens.
218
+
219
+
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 10000 from a random 20000 token original text using the o200k token set.
220
+
(All encodings were computed from scratch for each slice.)
221
+
222
+
The graph below shows encoding runtime vs slice length.
223
+
All encoders (except the heap encoder) show the expected linear runtime complexity.
224
+
The backtracking encoder, the fastest encoder that still returns correct results, shows a performance gain of approximately 3.5x compared to tiktoken.
225
+
The fully dynamic programming solution and the heap implementation are still quite competitive to TikToken (especially for smaller inputs).
226
+
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.
Incremental encoding tokenizes a text while appending bytes.
233
+
This type of algorithm is interesting for use cases where a certain token budget must not be exceeded.
234
+
This benchmark shows the runtime for the appending encoder when a text is encoded byte-by-byte.
235
+
For comparison we show the runtime of the backtracking encoder when it encodes the whole text at once.
236
+
237
+
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 10000 from a random 20000 token original using the o200k token set.
238
+
239
+
The graph below shows encoding runtime vs slice length.
240
+
The overall runtime of byte-by-byte incremental encoder for encoding the full text is comparable to the runtime of the backtracking encoder, with only a constant factor overhead.
241
+
Note that this is a huge win for incremental use cases, which would otherwise require retokenization after each append, resulting in a quadratic slowdown.
0 commit comments