Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix StaccatoTokenizer Handling of Zero-Width Characters and BOM #3657

Merged
merged 3 commits into from
Mar 31, 2025

Conversation

alanakbik
Copy link
Collaborator

@alanakbik alanakbik commented Mar 30, 2025

This pull request addresses issues in the StaccatoTokenizer related to the handling of certain non-printing Unicode characters, specifically zero-width characters like variation selectors (e.g., U+FE0F) and the Byte Order Mark (BOM, U+FEFF).

Problem:

The previous implementation could incorrectly split tokens or generate spurious empty/single-character tokens when encountering:

  1. Zero-width variation selectors often used with emojis (e.g., + U+FE0F).
  2. The Byte Order Mark (U+FEFF), which is sometimes present at the beginning of files or incorrectly used as a zero-width space within text.

This resulted in unexpected tokenization outputs like ['norm', '❤', '️', 'enjoyed'] instead of ['norm', '❤', 'enjoyed'] or adding an empty token at the end of a sentence if it contained a BOM.

Solution:

  1. Refactored Tokenization Logic: The tokenize method was changed from using re.split followed by whitespace splitting to using re.findall. A comprehensive regex pattern (self.token_pattern) was constructed to directly find and extract valid tokens based on defined categories: sequences of letters (across various scripts), sequences of digits, individual Kanji characters, and individual punctuation/symbol characters.
  2. Excluded Non-Printing Characters: The regex pattern used to identify punctuation/symbols (self.punctuation) was modified to explicitly exclude common zero-width characters (\uFE00-\uFE0F, \u200B-\u200D, \u2060-\u206F) and the BOM (\uFEFF) by adding them to the negated character set [^...].

Impact:

  • The StaccatoTokenizer now correctly ignores these non-printing characters during tokenization, preventing spurious splits and unwanted tokens.
  • Tokenization results are more robust and align better with the intended behavior of preserving contiguous letter/number sequences while splitting off meaningful punctuation and symbols.
  • Fixes issues observed with specific text examples containing emojis with variation selectors and trailing BOM characters.

Closes #3652

@alanakbik alanakbik merged commit 1e2a23a into master Mar 31, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: StaccatoTokenizer splits zero-length characters
1 participant