Skip to content

Conversation

@stephantul
Copy link
Contributor

@stephantul stephantul commented Sep 1, 2025

This PR replaces a lot of tokenization code with simple calls and validation from skeletoken.

This has the following effects:

  1. the code is much simpler
  2. from now on, every tokenizer is a greedy tokenizer, with optional lower casing (set to True by default)

This improves scores greatly on ModernBERT (which is a cased BPE tokenizer).

This PR also removes Python 3.9 support, since skeletoken does not support Python 3.9 because Pydantic does not have great support for unions in 3.9.

@stephantul stephantul marked this pull request as ready for review September 11, 2025 09:45
@codecov
Copy link

codecov bot commented Sep 11, 2025

Codecov Report

❌ Patch coverage is 89.81481% with 11 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
model2vec/tokenizer/tokenizer.py 91.35% 7 Missing ⚠️
model2vec/distill/distillation.py 80.95% 4 Missing ⚠️
Files with missing lines Coverage Δ
model2vec/distill/inference.py 97.53% <100.00%> (+0.03%) ⬆️
model2vec/tokenizer/__init__.py 100.00% <100.00%> (ø)
model2vec/tokenizer/datamodels.py 100.00% <100.00%> (ø)
model2vec/utils.py 90.38% <100.00%> (-0.36%) ⬇️
model2vec/distill/distillation.py 87.50% <80.95%> (+1.29%) ⬆️
model2vec/tokenizer/tokenizer.py 94.24% <91.35%> (+3.71%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@stephantul stephantul requested a review from Pringled September 11, 2025 09:49
@stephantul stephantul closed this Sep 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants