[lex.phases] Do not recognize UCNs in d-char sequences#8691
[lex.phases] Do not recognize UCNs in d-char sequences#8691AlisdairM wants to merge 1 commit intocplusplus:mainfrom
Conversation
We do not recognize universal-character-names when lexing character sequences for literals, and that should include d-char-sequences for raw string literals. Note that d-char-sequences are limited to a subset of the basic character set that excludes the \ that would mark the start of a universal-character-name, but if we recognized the UCN then that \ would be consumed when recognizing the universal-character-name, and the transformed character would then be ill-formed as either not a member of the basic character set, or as a UCN denoting an element of the basic character set. Rather than creating such an obscure error condition, it is simpler to not recognize UCNs for d-char-sequences just as we do not for any other character sequence and diagnose a more consistent error.
|
Since the lexing grammar of neither a d-char-sequence nor an r-char-sequence recognizes UCNs (same for h-char-sequence and q-char-sequence), I think the proper approach is to strike all those redundant mentions instead of adding more to the list. (Having a note that highlights lexing constructs that are oblivious to UCNs is fine, though.) I thought I saw a request in that direction fly by somewhere recently, but I can't find it right now. |
|
I believe that if we struck these mentions, then raw string literals would transform UCNs, as this is the wording that exempts the UCN transformation. The reason we have universal-character-name in the c-char and s-char grammar is to revert this rule to not do the transform. In principle, we could remove c-char-sequence and s-char-sequence from this list, and also strike universal-character-name from their respective grammar, as the phase 3 rules would transform the UCNs to elements of the translation character set before the token grammar is addressed. I am not recommending that change, as we have a consistent treatment of character sequences here, with only d-char-sequence and n-char-sequence omitted. As noted in my commit message, there is no normative impact adding d-char-sequence here, as we are just simplifying the nature of the rule making such usage ill-formed. In the n-char-sequence cases, I believe the current rules would allow for embedding universal-character-names inside an n-char-sequence and that could be valid if said UCN is one that is lexed as a c-char or s-char, e.g., |
|
The reason why we have universal-character-name separately in c-char and s-char is because we want to delay their interpretation until we initialize the string literal object in [lex.string] p10, even though p8 neuters the most obvious case where that could be exploited. Ah, it seems we can use a UCN to encode a new-line, but we can't have a literal new-line character in a string (according to the grammar for basic-s-char). If we would replace a suitable UCN with new-line in phase 3 for an s-char, we would cause an ill-formed s-char. The rule in lex.phases p3 is needed, though, because nothing otherwise matches and replaces UCNs in plain source code (outside of literals). That said, maybe we should expressly admit UCNs in the (lexing) grammar for identifier and use a simpler "not s-char, not c-char" rule in lex.phases p3, akin to lex.universal.char p1. |
|
It sounds like any wording change here will need careful Core review. Is this worth opening a Core issue over, or writing a paper to precisely describe the concerns and wording changes? Or would we prefer to treat any work here as something to evolve in this PR, and send to Core only when we have agreed and drafted a preferred direction? I have turned this PR into a draft until we are in agreement that there is something to update. |
We do not recognize universal-character-names when lexing character sequences for literals, and that should include d-char-sequences for raw string literals.
Note that d-char-sequences are limited to a subset of the basic character set that excludes the \ that would mark the start of a universal-character-name, but if we recognized the UCN then that \ would be consumed when recognizing the universal-character-name, and the transformed character would then be ill-formed as either not a member of the basic character set, or as a UCN denoting an element of the basic character set. Rather than creating such an obscure error condition, it is simpler to not recognize UCNs for d-char-sequences just as we do not for any other character sequence and diagnose a more consistent error.