[lex.phases] Do not recognize UCNs in d-char sequences by AlisdairM · Pull Request #8691 · cplusplus/draft

AlisdairM · 2025-12-30T14:43:42Z

We do not recognize universal-character-names when lexing character sequences for literals, and that should include d-char-sequences for raw string literals.

Note that d-char-sequences are limited to a subset of the basic character set that excludes the \ that would mark the start of a universal-character-name, but if we recognized the UCN then that \ would be consumed when recognizing the universal-character-name, and the transformed character would then be ill-formed as either not a member of the basic character set, or as a UCN denoting an element of the basic character set. Rather than creating such an obscure error condition, it is simpler to not recognize UCNs for d-char-sequences just as we do not for any other character sequence and diagnose a more consistent error.

We do not recognize universal-character-names when lexing character sequences for literals, and that should include d-char-sequences for raw string literals. Note that d-char-sequences are limited to a subset of the basic character set that excludes the \ that would mark the start of a universal-character-name, but if we recognized the UCN then that \ would be consumed when recognizing the universal-character-name, and the transformed character would then be ill-formed as either not a member of the basic character set, or as a UCN denoting an element of the basic character set. Rather than creating such an obscure error condition, it is simpler to not recognize UCNs for d-char-sequences just as we do not for any other character sequence and diagnose a more consistent error.

jensmaurer · 2025-12-30T18:31:35Z

Since the lexing grammar of neither a d-char-sequence nor an r-char-sequence recognizes UCNs (same for h-char-sequence and q-char-sequence), I think the proper approach is to strike all those redundant mentions instead of adding more to the list. (Having a note that highlights lexing constructs that are oblivious to UCNs is fine, though.) I thought I saw a request in that direction fly by somewhere recently, but I can't find it right now.

AlisdairM · 2025-12-30T20:28:02Z

I believe that if we struck these mentions, then raw string literals would transform UCNs, as this is the wording that exempts the UCN transformation. The reason we have universal-character-name in the c-char and s-char grammar is to revert this rule to not do the transform. In principle, we could remove c-char-sequence and s-char-sequence from this list, and also strike universal-character-name from their respective grammar, as the phase 3 rules would transform the UCNs to elements of the translation character set before the token grammar is addressed. I am not recommending that change, as we have a consistent treatment of character sequences here, with only d-char-sequence and n-char-sequence omitted. As noted in my commit message, there is no normative impact adding d-char-sequence here, as we are just simplifying the nature of the rule making such usage ill-formed. In the n-char-sequence cases, I believe the current rules would allow for embedding universal-character-names inside an n-char-sequence and that could be valid if said UCN is one that is lexed as a c-char or s-char, e.g., "\N{LATIN SMALL LETTER \N{LATIN SMALL LETTER A}}". Hence, addressing the n-char-sequence case would strictly demand a Core issue.

jensmaurer · 2025-12-30T21:25:09Z

The reason why we have universal-character-name separately in c-char and s-char is because we want to delay their interpretation until we initialize the string literal object in [lex.string] p10, even though p8 neuters the most obvious case where that could be exploited. Ah, it seems we can use a UCN to encode a new-line, but we can't have a literal new-line character in a string (according to the grammar for basic-s-char). If we would replace a suitable UCN with new-line in phase 3 for an s-char, we would cause an ill-formed s-char.

The rule in lex.phases p3 is needed, though, because nothing otherwise matches and replaces UCNs in plain source code (outside of literals). That said, maybe we should expressly admit UCNs in the (lexing) grammar for identifier and use a simpler "not s-char, not c-char" rule in lex.phases p3, akin to lex.universal.char p1.

AlisdairM · 2025-12-31T15:34:02Z

It sounds like any wording change here will need careful Core review. Is this worth opening a Core issue over, or writing a paper to precisely describe the concerns and wording changes? Or would we prefer to treat any work here as something to evolve in this PR, and send to Core only when we have agreed and drafted a preferred direction?

I have turned this PR into a draft until we are in agreement that there is something to update.

AlisdairM marked this pull request as draft December 31, 2025 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[lex.phases] Do not recognize UCNs in d-char sequences#8691

[lex.phases] Do not recognize UCNs in d-char sequences#8691
AlisdairM wants to merge 1 commit intocplusplus:mainfrom
AlisdairM:do_not_expand_universal_character_names_for_d_char

AlisdairM commented Dec 30, 2025

Uh oh!

jensmaurer commented Dec 30, 2025

Uh oh!

AlisdairM commented Dec 30, 2025

Uh oh!

jensmaurer commented Dec 30, 2025

Uh oh!

AlisdairM commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AlisdairM commented Dec 30, 2025

Uh oh!

jensmaurer commented Dec 30, 2025

Uh oh!

AlisdairM commented Dec 30, 2025

Uh oh!

jensmaurer commented Dec 30, 2025

Uh oh!

AlisdairM commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants