Skip to content

Narrow URL-like protection so PostScript glyph names inside numeric ranges are cleaned #100

Description

@fffoivos

During the 2026-04-28 wave-3 tokenizer review, the C1 added vocab had one residual PostScript-name token: /hyphenminus.

Corpus spot checks showed contexts such as:

  • 4.600/hyphenminus5.600 KDa
  • 0.3/hyphenminus 100%
  • 50.000/hyphenminus250.000 κατοίκους

The cleaner strips standalone foo /hyphenminus bar, and correctly preserves URL/path contexts such as https://example.org/a/hyphenminus/b, but it currently preserves numeric slash ranges because URL_LIKE_TOKEN_REGEX treats tokens like 4.600/hyphenminus5.600 as URL-like host/path tokens.

Suggested later fix:

  • keep protecting real URLs, www. links, and host/path tokens with alphabetic host components
  • do not classify purely numeric dotted ranges followed by slash as URL-like
  • add tests for numeric /hyphenminus stripping and real URL preservation

This is a small cleanup follow-up, not a blocker for the completed tokenizer wave.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions