feat: add skeletoken #273

stephantul · 2025-09-01T08:56:35Z

This PR replaces a lot of tokenization code with simple calls and validation from skeletoken.

This has the following effects:

the code is much simpler
from now on, every tokenizer is a greedy tokenizer, with optional lower casing (set to True by default)

This improves scores greatly on ModernBERT (which is a cased BPE tokenizer).

This PR also removes Python 3.9 support, since skeletoken does not support Python 3.9 because Pydantic does not have great support for unions in 3.9.

codecov · 2025-09-11T09:48:42Z

Codecov Report

❌ Patch coverage is 89.81481% with 11 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
model2vec/tokenizer/tokenizer.py	91.35%	7 Missing ⚠️
model2vec/distill/distillation.py	80.95%	4 Missing ⚠️

Files with missing lines	Coverage Δ
model2vec/distill/inference.py	`97.53% <100.00%> (+0.03%)`	⬆️
model2vec/tokenizer/__init__.py	`100.00% <100.00%> (ø)`
model2vec/tokenizer/datamodels.py	`100.00% <100.00%> (ø)`
model2vec/utils.py	`90.38% <100.00%> (-0.36%)`	⬇️
model2vec/distill/distillation.py	`87.50% <80.95%> (+1.29%)`	⬆️
model2vec/tokenizer/tokenizer.py	`94.24% <91.35%> (+3.71%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

stephantul added 5 commits September 1, 2025 10:54

feat: add skeletoken

829fad6

fix: new version of skeletoken

3f39da4

merge

dde95cd

fix uv lock

4b81fad

fix: add skeletoken

fa45e93

stephantul marked this pull request as ready for review September 11, 2025 09:45

merge

7a11a92

stephantul requested a review from Pringled September 11, 2025 09:49

remove unused import

95d3afd

stephantul closed this Sep 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add skeletoken #273

feat: add skeletoken #273

Uh oh!

stephantul commented Sep 1, 2025 •

edited

Loading

Uh oh!

codecov bot commented Sep 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add skeletoken #273

feat: add skeletoken #273

Uh oh!

Conversation

stephantul commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stephantul commented Sep 1, 2025 •

edited

Loading

codecov bot commented Sep 11, 2025 •

edited

Loading