Skip to content

Support gemma3 HF tokenizer.json #96

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

larryliu0820
Copy link
Contributor

@larryliu0820 larryliu0820 commented Jul 3, 2025

This PR beef'ed up HF tokenizer.

HF tokenizer changes

  • Added Normalization Module: Introduced the Normalizer base class and its derived classes (ReplaceNormalizer and SequenceNormalizer) to support customizable string normalization. A factory class, NormalizerConfig, was added to simplify normalizer creation and configuration.
  • HF-Specific BPE Logic: Added the HFWord structure and implemented HF-specific token merging logic in _byte_pair_merge. Overrode byte_pair_encode_ to integrate normalization and pre-tokenization. [1] [2]
  • Normalizer Integration: Integrated the normalization module into HFTokenizer, allowing it to load and use normalizers from JSON configuration during tokenizer initialization.

BPE Improvements:

  • Merge Map and Utility Function: Added a hash-based MergeMap type and the buildMergeRanksMap utility function to handle BPE merge rules efficiently. This ensures proper handling of token merging based on ranks.
  • Virtual Methods for BPE Logic: Added virtual methods (_byte_pair_merge and byte_pair_encode_) to allow derived classes to customize BPE merging logic. Updated BPETokenizerBase to use pre-computed merge ranks for token merging. [1] [2]

Tested with Gemma3 tokenizer.json manually

TEST(HFTokenizerTest, TestEncodePresidentQuestion) {
  HFTokenizer tokenizer;
  auto path = _get_resource_path("test_hf_tokenizer.json");
  auto error = tokenizer.load(path);
  EXPECT_EQ(error, Error::Ok);
  std::string text = "Who is the president of the US?";
  auto result = tokenizer.encode(text, /*bos*/ 1, /*eos*/ 0);
  EXPECT_TRUE(result.ok());
  std::vector<uint64_t> expected = {
      2, 15938, 563, 506, 6207, 529, 506, 2590, 236881};
  EXPECT_EQ(result.get(), expected);
}

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 3, 2025
@facebook-github-bot
Copy link
Contributor

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this in D77761574.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants