-
Notifications
You must be signed in to change notification settings - Fork 60
Validity of a char
value that is a surrogate
#513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is to enable niches so, |
And while rustc currently doesn't take advantage of surrogate code point niches for |
That would be circular reasoning, since the question here is whether rustc should. |
I don't see why it shouldn't tbh (or any other compiler, for that matter). |
Note that there is plenty of space past the end of the Unicode range for many niches. I don't think there's much need for the compiler using the surrogate range? |
Note that there are two "ranges" of invalid |
FWIW this has been the case since forever. That doesn't answer the question though. At this point, changing the safety invariant of |
We'd also have to at least add "you must not use it as the scrutinee in a(n exhaustive(?)) let c: char = transmute(0xD800_u32); // current UB
match c {
'\0'..='\u{d7ff}' => {},
'\u{e000}'.. => {},
_ => {}, // is this arm taken, or is this match UB?
}
// this match is definitely UB
match c {
'\0'..='\u{d7ff}' => {},
'\u{e000}'.. => {},
} (Not to mention other places where exhaustive patterns are allowed, e.g. |
All else being equal, simpler validity conditions are preferable. While the validity specification of So the question imo should be to ask what the benefit of either choice is. And there I believe having the valid range exclude the surrogate range wins out, since code that wants access to the surrogate can just use There's also a further simplicity benefit to primitives having equivalent safety and validity. References definitely break that equivalence (and this remains somewhat contentious) and pointers are likely to (is the vtable matching a safety or a validity thing), but |
Thanks everyone for all the answers. First, I wanted to clarify my question, and I'll use @zachs18 example for that. My question was whether UB let c: char = unsafe { transmute(0xD800_u32) }; // UB due to producing invalid value
match c { // UB due to using unsafe value
'\0'..='\u{d7ff}' => {},
'\u{e000}'.. => {},
_ => {}, // This is unreachable no mater what.
} If I understood this correctly, in practice, there's no immediate UB triggered today by producing a char value that is a surrogate point. That said, I do agree that having equivalent safety and validity requirements for primitive types is much simpler and easy to reason about. So maybe that's enough to close this case? Thanks again! |
As far as I know, the current situation is that this is immediate UB at the construction of |
You can also easily use Miri to confirm this. |
I think I'm missing one "should" in my sentence above. I get that the current UB is very well defined. Instead of:
Please read:
I edited my comment above to make it more clear. |
I don't know what you mean by "should". You could be asking about at which point it is common to see execution go haywire in practice, or you could be asking about whether it is better to specify that the illegal operation is the transmute, or the illegal operation is the match. English is awful. I'll try to answer both ways. The illegal operation should be the transmute. It is a fundamental property of In practice you will tend to see the program explode at the match statement, because the match will be lowered to some kind of jump table structure which jumps into some unforseen code offset if the value is in the niche. But of course in practice it is entirely possible to get a SIGILL or SIGEGV or other errant behavior before the transmute; UB is a property of the entire program execution. The compiler can use that transmute to reason backwards about what range the value of the input must have been, and often the goal of such optimization is to propagate unreachability backwards along control flow to clean up dead code. |
I think @celinval is asking why we don't change the spec to make the |
I'm OK keeping things as they are right now. May I close this issue? |
Similar to #78, I was wondering why a value of char that is a surrogate is considered invalid as opposed to valid-but-unsafe.
Can a surrogate point actually trigger immediate UB?
Thanks!
The text was updated successfully, but these errors were encountered: