Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: setup tokenizer on the provider model config #656

Merged
merged 8 commits into from
Jan 20, 2025

Conversation

salman1993
Copy link
Collaborator

branched off this PR branch: #650 (will merge after that PR)

only one tokenizer gets used depending on the model, so we put tokenizer name on the provider and then try our best to map the tokenizer name. this way we can avoid loading up these tokenizers that aren't used

baxen and others added 5 commits January 18, 2025 22:50
0. Pulls message conversion utils into format module
1. Overhauls the provider tests which were flakey in CI
2. Removes moderation references for now, as we are investigating how to
   bring them in without the false positives
3. Removes cost tracking, as we don't want to keep up to date with the
   pricing details. We will track tokens instead
it was nice to have ollama testing the format as an integration
in CI, but those are covered well by unit tests already and this
increased test times significantly
@salman1993 salman1993 requested review from wendytang and baxen January 20, 2025 17:46
"Xenova/llama3-tokenizer",
"Xenova/gemma-2-tokenizer",
"Qwen/Qwen2.5-Coder-32B-Instruct",
];
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if it makes sense to embed 5 tokenizer files in the binary. maybe we just need 2 or 3? @baxen @michaelneale

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big +1 let's remove all we can!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just kept gpt-4o and claude tokenizer. removed the other 3 (fallback is to download so its okay)

Copy link
Collaborator

@baxen baxen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

"Xenova/llama3-tokenizer",
"Xenova/gemma-2-tokenizer",
"Qwen/Qwen2.5-Coder-32B-Instruct",
];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big +1 let's remove all we can!

@salman1993 salman1993 merged commit 2a9aa97 into v1.0 Jan 20, 2025
4 checks passed
michaelneale added a commit that referenced this pull request Jan 20, 2025
* v1.0:
  refactor: use the reuseable bundle-desktop workflow in ci.yml (#658)
  feat: setup tokenizer on the provider model config (#656)
  fix: clean up providers (#650)
@yingjiehe-xyz yingjiehe-xyz deleted the sm/init-tokenizer branch February 5, 2025 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants