Regex matching and combining characters

Copying in https://github.com/rakudo/rakudo/issues/5799:

This actually was a behavior-change (long ago) that caused a program of mine to stop working. I spoke to people on IRC about it back on 2023-03-17, but I realized it might as well be listed here anyway, since if it isn't a bug, it at least needs to be documented as a difference with what might be expected of a regex-matcher (specifically in comparison to perl5).

Consider Perl5 matching:
```perl
use feature "say";

say "o\N{COMBINING RING ABOVE}" =~ /o/; # -> 1 (matches)
say "o\N{COMBINING RING ABOVE}" =~ /.\N{COMBINING RING ABOVE}/ #  -> 1 (matches)
```
We are matching "o̊" (which has no precomposed form in Unicode) against bare o and against some letter followed by the combining ring accent, and in both cases it matches. Look at the equivalent in Raku:
```raku
say "o\c[combining ring above]" ~~ /o/; # -> Nil (fails!)
say "o\c[combining ring above]" ~~ /.\c[combining ring above]/; # -> Nil (fails!)
# Demonstrate that \c[] works fine in the regexp
say "o\c[combining ring above]" ~~ /o\c[combining ring above]/; # -> ｢o̊｣ (matches)
```
The same strings, the same matches, but this time they fail.

I know this is due to Raku's (unique/idiosyncratic) "NFG" matching, instead of by characters or whatever, and I know it could be argued that this is correct behavior, but even if it is, it needs to be documented as a difference from what other regex engines give (and maybe from what the Unicode standard prescribes?)

This needs at least to be documented; it is decidedly something peculiar to Raku and not what people would expect coming from elsewhere.

(the rest of this post is pointing out problems with the current behavior; you don't have to read it. The main point is if we are NOT going to change this, the behavior should be documented.)

It may be objected that this is a bizarre case and people shouldn't want to search like this (which is no excuse for not doing it right), or that this is really what people should/would expect for this kind of searching (which IS an excuse, but still needs to be mentioned). That is, "o" and "o-with-ring" ought to be considered a distinct letter from "o", and you wouldn't be searching for a particular diacritical. But that isn't really true. It makes sense for Latin letters, etc, but this bit me in a program I had that analyzed cantillations (combining characters, Unicode points U+0591 - U+05AE) in the Hebrew Bible. Cantillations are combining characters, but semantically they are essentially punctuation marks, and wanting to search for `/<letter>*\c[hebrew accent tipeha]<letter>*/` is just as sensible as wanting to search for `/<letter*,/` to find the last word in comma-delimited clauses in another language (it's actually more sensible than that, really, because the cantillations are more structured.) And this actually is already glossing over the problem of Hebrew vowels! Letters בֶ and ב really should be considered the same letter. So should א and אַ (in Hebrew, at least, not in Yiddish). אַ is a precomposed character, but it's a Unicode composition exclusion, so Raku normalizes it to two codepoints as it should be, so at least that doesn't complicate things.

-------
It was suggested (by @timo) that an issue be created here to track this and related notions.  Apparently some thought was in fact given to the problem in speculations for Perl6.  It was proposed that there could be modifiers on the matching to control the level and type of Unicode support, specifying whether `.` meant a byte, a codepoint, a grapheme, or something language-dependent, etc.  See @timo's comment at https://github.com/rakudo/rakudo/issues/5799

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regex matching and combining characters #471

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Regex matching and combining characters #471

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions