Fix StaccatoTokenizer Handling of Zero-Width Characters and BOM #3657

alanakbik · 2025-03-30T14:02:53Z

This pull request addresses issues in the StaccatoTokenizer related to the handling of certain non-printing Unicode characters, specifically zero-width characters like variation selectors (e.g., U+FE0F) and the Byte Order Mark (BOM, U+FEFF).

Problem:

The previous implementation could incorrectly split tokens or generate spurious empty/single-character tokens when encountering:

Zero-width variation selectors often used with emojis (e.g., ❤ + ️ U+FE0F).
The Byte Order Mark (U+FEFF), which is sometimes present at the beginning of files or incorrectly used as a zero-width space within text.

This resulted in unexpected tokenization outputs like ['norm', '❤', '️', 'enjoyed'] instead of ['norm', '❤', 'enjoyed'] or adding an empty token at the end of a sentence if it contained a BOM.

Solution:

Refactored Tokenization Logic: The tokenize method was changed from using re.split followed by whitespace splitting to using re.findall. A comprehensive regex pattern (self.token_pattern) was constructed to directly find and extract valid tokens based on defined categories: sequences of letters (across various scripts), sequences of digits, individual Kanji characters, and individual punctuation/symbol characters.
Excluded Non-Printing Characters: The regex pattern used to identify punctuation/symbols (self.punctuation) was modified to explicitly exclude common zero-width characters (\uFE00-\uFE0F, \u200B-\u200D, \u2060-\u206F) and the BOM (\uFEFF) by adding them to the negated character set [^...].

Impact:

The StaccatoTokenizer now correctly ignores these non-printing characters during tokenization, preventing spurious splits and unwanted tokens.
Tokenization results are more robust and align better with the intended behavior of preserving contiguous letter/number sequences while splitting off meaningful punctuation and symbols.
Fixes issues observed with specific text examples containing emojis with variation selectors and trailing BOM characters.

Closes #3652

alanakbik added 3 commits March 30, 2025 23:00

GH-3652: fix tokenization issue

b5328bb

GH-3652: black formatting

e0df2be

GH-3652: mypy

233d832

alanakbik merged commit 1e2a23a into master Mar 31, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix StaccatoTokenizer Handling of Zero-Width Characters and BOM #3657

Fix StaccatoTokenizer Handling of Zero-Width Characters and BOM #3657

alanakbik commented Mar 30, 2025 •

edited

Loading

Fix StaccatoTokenizer Handling of Zero-Width Characters and BOM #3657

Fix StaccatoTokenizer Handling of Zero-Width Characters and BOM #3657

Conversation

alanakbik commented Mar 30, 2025 • edited Loading

alanakbik commented Mar 30, 2025 •

edited

Loading