Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FxHash and ShortStringOptimization. #1733

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

MeetThePatel
Copy link

@MeetThePatel MeetThePatel commented Feb 10, 2025

This PR contains performance optimizations (spreadsheet of benchmark comparisons linked below). The two main optimizations are:

  • Switching to FxHash instead of standard library default hasher. This is provided by the rustc_hash crate, which the compiler uses internally.

  • Switch from String to CompactString, provided by the compact_str crate, which has short-string-optimization for up to 24 bytes, which many tokens can fit inside of.

Progress:

This PR is not fully complete. At this point in time, the following tasks have been completed:

  • Convert base crate.
    • Tests are passing, and benchmarks were run (linked below).
  • Convert Python bindings.
    • The same tests that pass on HEAD are passing on this branch. There seems to be 2 failing tests on HEAD, but this branch fails the exact same tests, and in the same manner (i.e. same output).
    • There seems to be an issue with my implementation of the pyo3. The benchmarks for the base crate are outperforming HEAD by a decent margin, but the benches/test_tiktoken.py are basically the same (within a margin of error) of HEAD. I think it is caused by unnecessary type conversions.
  • Convert Node bindings.
  • Cleanup.

Benchmarks:

The benchmarks are at the following link: Tokenizer Benchmarks. All benchmarks were run on a Macbook Pro with the M2 Pro chip and 16GB memory.

Additionally, the BPE Train vocabulary (huge) benchmark is one that I added (not committed, as that would warrant its own PR). This benchmark is using the dataset: One Billion Word Challenge, which clocks in at 4.15GB.

@ArthurZucker
Copy link
Collaborator

Sounds good! Node bidings are really not necessary!

@ArthurZucker
Copy link
Collaborator

Do you want this to be reviewed now?

@MeetThePatel
Copy link
Author

Not yet. I'm still working on a few things:

  • See if there actually is a bottleneck in the Python-Rust interface (regarding extra type conversions), or if the benchmarks aren't pushing the crate hard enough to see a meaningful difference. It should be ~10-15% faster (according to the base crate BPE encode benchmark).
  • Clean up code quality.

@MeetThePatel
Copy link
Author

I think this is ready for review.

Also, these are the distributions for the benchmark runs (blue is this PR, red is HEAD). Besides the speed being a bit faster, it seems to be more consistent as well (as least on my machine).

HEAD vs PR base crate benchmarks.pdf

@MeetThePatel MeetThePatel marked this pull request as ready for review February 11, 2025 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants