-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Allow non-ascii identifiers #4151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Why generated with 5.2.0 rather than the latest? Could you check to see if it's different for the purposes of this proposal? |
Thanks for pointing that out. I misunderstood the versioning scheme. I updated my script to use data version 12.1.0, and resulting table is identical. I guess that says something about the stability of this proposal. |
I do think this is the best way to get Unicode identifiers in zig, but I'm still afraid of bugs caused by differing normalisation. Perhaps we could have |
@daurnimator That seems like a decent solution to me. It wouldn't be futureproof due to the addition of new characters over time, but since this is just a warning and not an error there are no compatibility issues introduced from that. |
U+00A0 is NO-BREAK SPACE: const hello var = "hello";
std.debug.warn("{}\n", .{hello var}); U+00B2 is SUPERSCRIPT TWO: const ²name = "joe"
std.debug.warn("{}\n", .{²name}); If anyone is interested in a comparison, Swift language specs out their identifiers codepoints pretty concisely. |
I'm quite happy with this proposal. I think it's the simplest possible proposal that's still well-rounded. There are more complex solutions, and they're good too, but this is a very good compromise between features and utility. |
The script looks for "Pattern_White_Space" but it should probably also exclude "White_Space" |
I don't claim to be an expert on Unicode, but unicode.org does, and they (indirectly) suggest allowing U+00A0 (NO-BREAK SPACE) in identifiers. To quote the experts:
If we wanted the compiler to try to prevent abuse in identifier naming, we would need something far more sophisticated than this proposal. See the discussion in #3947. This proposal requires that programmers are trying to be responsible with identifier naming. We're opening the door to nasty code obfuscation, if programmers were trying to write obfuscated code. Are we worried about that? I don't think we need to worry about U+00A0 (NO-BREAK SPACE). Maybe some language out there would really like to use it for their identifiers. It has no place in an English codebase, but this proposal is not for English speakers. |
This proposal deserves a more accurate title, like: Allow (only) UTF-8/UTF-16/...-encoded identifiers? -- If one doesn't care about U+00A0 then why not ignore U+0085 (NEL), U+2028 (LS), U+2029 (PS)? Still complicated anyway. When would you type those invisible characters in source code besides in raw strings? |
I can't imagine the NBSP character being useful for identifiers in any language. It would be terribly confusing given that it's visually identical to a normal space except for its line breaking rules. Even in languages where it is used, it's mostly just to keep punctuation on the same line and stuff like that. An underscore should work just fine.
I think Zig already allows for all sorts of horribly obfuscated code, but this can't be considered a bug unless it makes it easy to do so accidentally. You could use |
@iology see #663 regarding UTF-16 and the characters you listed. |
Thanks. Now I see
So if whitespace like U+00A0 (NBSP) and U+200B (ZERO WIDTH SPACE) are allowed in identifier names, then I question restriction (2). Even further, I think all valid code points outside 00-7F can be allowed if they will never affect the functionality of compilers and editors. Of course this opinion only holds when extreme tiny handling for Unicode is required.
Another observation: U+1680 (OGHAM SPACE MARK) appears as a dash symbol in the font Noto Sans Ogham (is it important?). |
@thejoshwolfe The problem statement - From a comment Andrew made on the Unicode identifier thread:
The proposed solution - From Unicode 12.0.0 standard, annex 31: Immutable Identifier Syntax:
You summed this up pretty well, but maybe quoting this part of the unicode thing you linked directly could be valuable too? Not sure how much more discussion needs to happen though, so this could be just make-work. |
I read more carefully... (highlighted by me) https://unicode.org/reports/tr31/#Immutable_Identifier_Syntax
and (as you can notice, the generated list include almost all non-ascii code points)
This means the future version might introduce more whitespace or useful letters through previously unassigned code points. |
I suggest a variant of the "naive" approach mentioned earlier, that involves a whitelist roughly of this form: // *** unicode_whitelist.zig ***
// *** unicode version xxxxxx ***
// This struct definition would be a builtin type.
const CodePointEntry = struct{
codePointStr : []const u8, // could also be a raw number, or the utf-8 encoding directly
symbol : []const u8,
asciiName : []const u8,
};
const greekSymbols : [_]CodePointEntry = .{
.{"U+0370", "Ͱ", "heta"},
.{"U+03A9", "Ω", "omega"},
// ....and so on
};
// this table is consulted by tokenizer when accepting or rejecting symbols used in identifiers.
// all entries within table must be unique, with no duplication on any of the individual fields
const whiteListCodepoints : [_]CodePointEntry = greekSymbols ++ cyrillicSymbols ++ ... ;
This file would be imported by zig build, and the compiler would consult the white list table for any non-ascii utf8 symbol encountered in an identifier. If symbol is included in table, keep compiling, otherwise give a compile error. This way, the tokenization of .zig files would be independent of unicode, and it would be the responsibility of the user to use a sensible subset of unicode for the best trade off between usability, readability and the pitfalls of unicode identifiers. It would also be easy for the community to provide curated white lists, or for larger projects to have project specific white lists catered to their use case. |
I don't get why this proposal proposes a whitelist while in the other discussion it seems like most people were agreeing the simple blacklist is the way to go. Just as suggested here...
Seems simple enough and it's the responsibility of the programmer not to do anything weird like using a name of all non-breaking spaces or whatever. Maintaining a whitelist seems like an unnecessary burden. Just disallow what you need for other uses and allow the rest. Simple enough, right? |
This thread is talking about the core language specification and what sorts of constraints it imposes. The allowed character list sounds more like a feature request for the specific implementation of the language. Also, it sounds more like the job of a linter than something that really should be in the core language. I am not sure what problems it solves, or sure that it solves them well. |
Maybe it wasn't clear, my suggestion is essentially:
Defining the valid subset of utf8 encoded unicode symbols in the core language specification is of course also a perfectly fine approach. |
@BarabasGitHub Regarding whitelist vs blacklist: those are equivalent. The range of all possible unicode codepoints is @user00e00 I don't like the idea of a project defining something so fundamental to the language as lexer rules. You'll inevitably get into weird situations where a source file builds in one context but not in another. Then part of the interface for dealing with a code file is a dependency on allowing certain characters. Do we want to make that API formal with some kind of comment syntax? This is a whole can of worms that waves big red flags that this is a bad idea. I'm perfectly happy with a linter producing linter errors when identifiers fit or deviate from project-defined patterns. That's within the domain of a linter. Typically, a project does not run its linters on 3rd-party dependency code, because the code was written in a different project context. However, the compiler and its lexer must process projects and all their transitive dependencies. It is wrong for a project to impose subjective restrictions on 3rd-party code. (And I know what you're thinking now: have the lexer rules be project-specific and each dependency project can define its own lexer rules; sure, but you're effectively just describing a linter, which i'm arguing is the proper solution to this situation.) |
The difference is listing the few character that you need for Zig syntax or listing the rest of unicode space. I see it as either: "These characters are part of Zig syntax and keywords, so you can't use them as/in identifiers, but everything else is fine." or: "We've audited all characters of all languages and we've decided that these characters are not suitable in identifiers." Or even if you're not doing that it's like listing the whole dictionary minus five words, instead of just those five words. Anyway I agree with you that a linter is a better solution than having this be part of the compilation. |
I don't think whitelists or blacklists are a good idea. Unicode might be a mess, but it's the least-bad option that pretty much everybody has been able to accept as a standard. The subset of Unicode included in this issue doesn't have the same status. Maybe it will in the future, but that seems unlikely. I also don't love the idea of Unicode identifiers that aren't clearly marked as such, e.g. using the @"..." syntax. Is it really a good idea to allow identifiers that look exactly like other tokens? If so, then why have blacklists? And if not, then why allow code-points that haven't been assigned yet? |
Thank you for the discussion all. I am closing this in favor of status quo. The |
Does this mean Zig is never going to add first class support for developers who use non English languages as their first language? I mean, would you be happy to have to write:
Rust and C++ both support non English developers. Is the Zig core team fully US based? I suspect English speakers don't feel the friction this type of issue creates for language adoption in context where people are using C, C++, rust, etc.... People are not going to want to pick up a new language which forces them to stop using their own language while they are coding. Can we get this issue reopened for reconsideration? |
Here is a concrete proposal for #3947 (comment) .
Background
All Zig code is always encoded in UTF-8, and this proposal does not change that.
This proposal does not change the interpretation of ASCII codepoints anywhere in Zig code.
The only non-ascii codepoints with special handling in Zig before this proposal are: U+0085 (NEL), U+2028 (LS), U+2029 (PS). This proposal does not change the interpretation of these codepoints; they are not allowed in identifiers.
Proposal
Zig's current lexical rule for identifiers is:
This proposal adds the codepoints listed in the table below to both the ranges
[A-Zabd-z_]
and[A-Za-z0-9_]
in the above rule.Explanation
This set of codepoints was determined by following the recommendation here: https://unicode.org/reports/tr31/#Immutable_Identifier_Syntax . Specifically, this is the set of all characters except characters meeting any of these criteria:
Unicode Character Data version 5.2.0 was used to generate this list, but this list can remain stable forever despite future versions to Unicode Character Data, as per the recommendation and discussion in tr31 linked above. (EDIT: @daurnimator pointed out that this is many major versions behind, but even using the latest version 12.1.0, the list of codepoints in this proposal is identical.)
The code I used to generate the above set of codepoints can be found here: https://github.com/ziglang/zig/blob/6f8e2fad94fde6c9a8c4ca52d964d0616690ee4c/tools/gen_id_char_table.py
The text was updated successfully, but these errors were encountered: