Support gemma3 HF tokenizer.json #96
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR beef'ed up HF tokenizer.
HF tokenizer changes
Normalizer
base class and its derived classes (ReplaceNormalizer
andSequenceNormalizer
) to support customizable string normalization. A factory class,NormalizerConfig
, was added to simplify normalizer creation and configuration.HFWord
structure and implemented HF-specific token merging logic in_byte_pair_merge
. Overrodebyte_pair_encode_
to integrate normalization and pre-tokenization. [1] [2]HFTokenizer
, allowing it to load and use normalizers from JSON configuration during tokenizer initialization.BPE Improvements:
MergeMap
type and thebuildMergeRanksMap
utility function to handle BPE merge rules efficiently. This ensures proper handling of token merging based on ranks._byte_pair_merge
andbyte_pair_encode_
) to allow derived classes to customize BPE merging logic. UpdatedBPETokenizerBase
to use pre-computed merge ranks for token merging. [1] [2]Tested with Gemma3 tokenizer.json manually