feat: setup tokenizer on the provider model config #656

salman1993 · 2025-01-20T17:31:09Z

branched off this PR branch: #650 (will merge after that PR)

only one tokenizer gets used depending on the model, so we put tokenizer name on the provider and then try our best to map the tokenizer name. this way we can avoid loading up these tokenizers that aren't used

0. Pulls message conversion utils into format module 1. Overhauls the provider tests which were flakey in CI 2. Removes moderation references for now, as we are investigating how to bring them in without the false positives 3. Removes cost tracking, as we don't want to keep up to date with the pricing details. We will track tokens instead

it was nice to have ollama testing the format as an integration in CI, but those are covered well by unit tests already and this increased test times significantly

salman1993 · 2025-01-20T17:55:20Z

crates/goose/build.rs

    "Xenova/llama3-tokenizer",
+    "Xenova/gemma-2-tokenizer",
+    "Qwen/Qwen2.5-Coder-32B-Instruct",
 ];


not sure if it makes sense to embed 5 tokenizer files in the binary. maybe we just need 2 or 3? @baxen @michaelneale

big +1 let's remove all we can!

i just kept gpt-4o and claude tokenizer. removed the other 3 (fallback is to download so its okay)

baxen

LGTM

baxen · 2025-01-20T19:50:30Z

crates/goose/build.rs

    "Xenova/llama3-tokenizer",
+    "Xenova/gemma-2-tokenizer",
+    "Qwen/Qwen2.5-Coder-32B-Instruct",
 ];


big +1 let's remove all we can!

* origin/v1.0: fix: clean up providers (#650)

* v1.0: refactor: use the reuseable bundle-desktop workflow in ci.yml (#658) feat: setup tokenizer on the provider model config (#656) fix: clean up providers (#650)

baxen and others added 5 commits January 18, 2025 22:50

speed up tests

5a547f1

it was nice to have ollama testing the format as an integration in CI, but those are covered well by unit tests already and this increased test times significantly

move gnome keyring unlock

f05e05f

set tokenizer field on model_config, init only one

ecf842a

minor updates

03b1db6

salman1993 requested review from wendytang and baxen January 20, 2025 17:46

salman1993 commented Jan 20, 2025

View reviewed changes

salman1993 mentioned this pull request Jan 20, 2025

feat: dynamically load tokenizer #635

Closed

baxen approved these changes Jan 20, 2025

View reviewed changes

salman1993 added 3 commits January 20, 2025 15:10

remove 3 tokenizers

0b7ef0e

Merge remote-tracking branch 'origin/v1.0' into sm/init-tokenizer

e1d4aae

* origin/v1.0: fix: clean up providers (#650)

rename model -> tokenizer_name

dbd91ef

salman1993 merged commit 2a9aa97 into v1.0 Jan 20, 2025
4 checks passed

acekyd pushed a commit that referenced this pull request Jan 21, 2025

feat: setup tokenizer on the provider model config (#656)

e8b63b8

yingjiehe-xyz deleted the sm/init-tokenizer branch February 5, 2025 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: setup tokenizer on the provider model config #656

feat: setup tokenizer on the provider model config #656

salman1993 commented Jan 20, 2025

salman1993 Jan 20, 2025

baxen Jan 20, 2025

salman1993 Jan 20, 2025

baxen left a comment

baxen Jan 20, 2025

feat: setup tokenizer on the provider model config #656

feat: setup tokenizer on the provider model config #656

Conversation

salman1993 commented Jan 20, 2025

salman1993 Jan 20, 2025

Choose a reason for hiding this comment

baxen Jan 20, 2025

Choose a reason for hiding this comment

salman1993 Jan 20, 2025

Choose a reason for hiding this comment

baxen left a comment

Choose a reason for hiding this comment

baxen Jan 20, 2025

Choose a reason for hiding this comment