Add FxHash and ShortStringOptimization. #1733

MeetThePatel · 2025-02-10T21:46:31Z

This PR contains performance optimizations (spreadsheet of benchmark comparisons linked below). The two main optimizations are:

Switching to FxHash instead of standard library default hasher. This is provided by the rustc_hash crate, which the compiler uses internally.
Switch from String to CompactString, provided by the compact_str crate, which has short-string-optimization for up to 24 bytes, which many tokens can fit inside of.

Progress:

This PR is not fully complete. At this point in time, the following tasks have been completed:

Convert base crate.
- Tests are passing, and benchmarks were run (linked below).
Convert Python bindings.
- The same tests that pass on HEAD are passing on this branch. There seems to be 2 failing tests on HEAD, but this branch fails the exact same tests, and in the same manner (i.e. same output).
- There seems to be an issue with my implementation of the pyo3. The benchmarks for the base crate are outperforming HEAD by a decent margin, but the benches/test_tiktoken.py are basically the same (within a margin of error) of HEAD. I think it is caused by unnecessary type conversions.
Convert Node bindings.
Cleanup.

Benchmarks:

The benchmarks are at the following link: Tokenizer Benchmarks. All benchmarks were run on a Macbook Pro with the M2 Pro chip and 16GB memory.

Additionally, the BPE Train vocabulary (huge) benchmark is one that I added (not committed, as that would warrant its own PR). This benchmark is using the dataset: One Billion Word Challenge, which clocks in at 4.15GB.

ArthurZucker · 2025-02-11T09:41:50Z

Sounds good! Node bidings are really not necessary!

ArthurZucker · 2025-02-11T09:41:58Z

Do you want this to be reviewed now?

MeetThePatel · 2025-02-11T14:38:11Z

Not yet. I'm still working on a few things:

See if there actually is a bottleneck in the Python-Rust interface (regarding extra type conversions), or if the benchmarks aren't pushing the crate hard enough to see a meaningful difference. It should be ~10-15% faster (according to the base crate BPE encode benchmark).
Clean up code quality.

MeetThePatel · 2025-02-11T22:55:33Z

I think this is ready for review.

Also, these are the distributions for the benchmark runs (blue is this PR, red is HEAD). Besides the speed being a bit faster, it seems to be more consistent as well (as least on my machine).

HEAD vs PR base crate benchmarks.pdf

McPatate · 2025-03-17T15:45:39Z

Hi! Thanks for the big contribution. Would you mind splitting the PR in (at least) two? One for each optimisation, this would make reviewing easier.

MeetThePatel marked this pull request as ready for review February 11, 2025 22:55

MeetThePatel force-pushed the feature-sso branch from d7fceae to 3f340bb Compare March 10, 2025 17:11

MeetThePatel added 12 commits March 10, 2025 13:16

Switch tokenizers base library to rustc_hash.

7b07a23

Convert tokenizers python bindings to rustc_hash

58f8575

Convert tokenizers node bindings to rustc_hash

96638eb

Add SSO using compact_str.

7bf1c12

Make decode generic for bindings wrapper structs.

a8cfebe

Fix doctests.

2ff1f4c

Make function signature generic for bindings

8d52856

Fix shortcomings of last commit.

fe9f594

Got Python bindings working.

86f3e09

Remove tokenizers/utils/compact_string.rs

0fd7b9d

Cleanup.

bb0412f

Cleanup python bindings.

bdc6095

MeetThePatel force-pushed the feature-sso branch from 3f340bb to bdc6095 Compare March 10, 2025 17:17

MeetThePatel mentioned this pull request Mar 19, 2025

Switch to FXHash #1752

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FxHash and ShortStringOptimization. #1733

Add FxHash and ShortStringOptimization. #1733

MeetThePatel commented Feb 10, 2025 •

edited

Loading

ArthurZucker commented Feb 11, 2025

ArthurZucker commented Feb 11, 2025

MeetThePatel commented Feb 11, 2025

MeetThePatel commented Feb 11, 2025

McPatate commented Mar 17, 2025 •

edited

Loading

Add FxHash and ShortStringOptimization. #1733

Are you sure you want to change the base?

Add FxHash and ShortStringOptimization. #1733

Conversation

MeetThePatel commented Feb 10, 2025 • edited Loading

Progress:

Benchmarks:

ArthurZucker commented Feb 11, 2025

ArthurZucker commented Feb 11, 2025

MeetThePatel commented Feb 11, 2025

MeetThePatel commented Feb 11, 2025

McPatate commented Mar 17, 2025 • edited Loading

MeetThePatel commented Feb 10, 2025 •

edited

Loading

McPatate commented Mar 17, 2025 •

edited

Loading