Skip to content

Commit 0f70122

Browse files
committed
docs: improve section on special tokens
1 parent 4f64377 commit 0f70122

File tree

1 file changed

+14
-4
lines changed

1 file changed

+14
-4
lines changed

Diff for: README.md

+14-4
Original file line numberDiff line numberDiff line change
@@ -195,10 +195,12 @@ Note: if you're using `gpt-3.5-*` or `gpt-4-*` and don't see the model you're lo
195195

196196
## API
197197

198-
### `encode(text: string): number[]`
198+
### `encode(text: string, encodeOptions?: EncodeOptions): number[]`
199199

200200
Encodes the given text into a sequence of tokens. Use this method when you need to transform a piece of text into the token format that the GPT models can process.
201201

202+
The optional `encodeOptions` parameter allows you to specify special token handling (see [special tokens](#special-tokens)).
203+
202204
Example:
203205

204206
```typescript
@@ -327,7 +329,11 @@ async function processTokens(asyncTokensIterator) {
327329
## Special tokens
328330

329331
There are a few special tokens that are used by the GPT models.
330-
Not all models support all of these tokens.
332+
Note that not all models support all of these tokens.
333+
334+
By default, **all special tokens are disallowed**.
335+
336+
The `encode`, `encodeGenerator` and `countTokens` functions accept an `EncodeOptions` parameter to customize special token handling:
331337

332338
### Custom Allowed Sets
333339

@@ -349,12 +355,14 @@ import {
349355

350356
const inputText = `Some Text ${EndOfPrompt}`
351357
const allowedSpecialTokens = new Set([EndOfPrompt])
352-
const encoded = encode(inputText, allowedSpecialTokens)
358+
const encoded = encode(inputText, { allowedSpecialTokens })
353359
const expectedEncoded = [8538, 2991, 220, 100276]
354360

355361
expect(encoded).toBe(expectedEncoded)
356362
```
357363

364+
You may also use a special shorthand for either disallowing or allowing all special tokens, by passing in the string `'all'`, e.g. `{ allowedSpecial: 'all' }`.
365+
358366
### Custom Disallowed Sets
359367

360368
Similarly, you can specify custom sets of disallowed special tokens when encoding text. Pass a `Set`
@@ -366,11 +374,13 @@ import { encode, EndOfText } from 'gpt-tokenizer'
366374
const inputText = `Some Text ${EndOfText}`
367375
const disallowedSpecial = new Set([EndOfText])
368376
// throws an error:
369-
const encoded = encode(inputText, undefined, disallowedSpecial)
377+
const encoded = encode(inputText, { disallowedSpecial })
370378
```
371379

372380
In this example, an Error is thrown, because the input text contains a disallowed special token.
373381

382+
If both `allowedSpecialTokens` and `disallowedSpecial` are provided, `disallowedSpecial` takes precedence.
383+
374384
## Performance Optimization
375385

376386
### LRU Merge Cache

0 commit comments

Comments
 (0)