-
Notifications
You must be signed in to change notification settings - Fork 870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FxHash and ShortStringOptimization. #1733
base: main
Are you sure you want to change the base?
Conversation
Sounds good! Node bidings are really not necessary! |
Do you want this to be reviewed now? |
Not yet. I'm still working on a few things:
|
I think this is ready for review. Also, these are the distributions for the benchmark runs (blue is this PR, red is HEAD). Besides the speed being a bit faster, it seems to be more consistent as well (as least on my machine). |
d7fceae
to
3f340bb
Compare
3f340bb
to
bdc6095
Compare
Hi! Thanks for the big contribution. Would you mind splitting the PR in (at least) two? One for each optimisation, this would make reviewing easier. |
This PR contains performance optimizations (spreadsheet of benchmark comparisons linked below). The two main optimizations are:
Switching to FxHash instead of standard library default hasher. This is provided by the
rustc_hash
crate, which the compiler uses internally.Switch from
String
toCompactString
, provided by thecompact_str
crate, which has short-string-optimization for up to 24 bytes, which many tokens can fit inside of.Progress:
This PR is not fully complete. At this point in time, the following tasks have been completed:
benches/test_tiktoken.py
are basically the same (within a margin of error) of HEAD. I think it is caused by unnecessary type conversions.Benchmarks:
The benchmarks are at the following link: Tokenizer Benchmarks. All benchmarks were run on a Macbook Pro with the M2 Pro chip and 16GB memory.
Additionally, the
BPE Train vocabulary (huge)
benchmark is one that I added (not committed, as that would warrant its own PR). This benchmark is using the dataset: One Billion Word Challenge, which clocks in at 4.15GB.