Skip to content

Validity of a char value that is a surrogate #513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
celinval opened this issue May 24, 2024 · 16 comments
Closed

Validity of a char value that is a surrogate #513

celinval opened this issue May 24, 2024 · 16 comments

Comments

@celinval
Copy link

Similar to #78, I was wondering why a value of char that is a surrogate is considered invalid as opposed to valid-but-unsafe.

Can a surrogate point actually trigger immediate UB?

Thanks!

@RustyYato
Copy link

This is to enable niches so, Option<char> is the same size as char, if it was a safety condition (instead of a validity condition), this niche optimization wouldn't be allowed since the compiler can only take advantage of validation conditions.

@chorman0773
Copy link
Contributor

And while rustc currently doesn't take advantage of surrogate code point niches for char, nothing precludes it from doing so.

@saethlin
Copy link
Member

nothing precludes it from doing so

That would be circular reasoning, since the question here is whether rustc should.

@chorman0773
Copy link
Contributor

I don't see why it shouldn't tbh (or any other compiler, for that matter).

@ChrisDenton
Copy link
Member

Note that there is plenty of space past the end of the Unicode range for many niches. I don't think there's much need for the compiler using the surrogate range?

@zachs18
Copy link

zachs18 commented May 25, 2024

This is to enable niches so, Option<char> is the same size as char, if it was a safety condition (instead of a validity condition), this niche optimization wouldn't be allowed since the compiler can only take advantage of validation conditions.

Note that there are two "ranges" of invalid char values, 0x0000_D800..=0x0000_DFFF and 0x0011_0000... The first is the "surrogate range" in question, but currently only the second is used for niches by the compiler (as you can see by transmute::<Option<char>, u32>(None) == 0x0011_0000, though this is not a stable guarantee), so as ChrisDenton says it is not necessary that chars in 0x0000_D800..=0x0000_DFFF are "Insta-UB"/validity-invariant-violations to allow niches in char.

@RalfJung
Copy link
Member

RalfJung commented May 26, 2024

FWIW this has been the case since forever. That doesn't answer the question though.

At this point, changing the safety invariant of char to allow surrogates has a risk of subtly breaking char-consuming functions. So the only wiggle room we have is saying that a surrogate is valid but unsafe -- you can construct it without causing UB but you must not call any third-party function on such a value as public APIs may assume that a char is never a surrogate. I'm uncertain whether that's worth it.

@zachs18
Copy link

zachs18 commented May 26, 2024

So the only wiggle room we have is saying that a surrogate is valid but unsafe -- you can construct it without causing UB but you must not call any third-party function on such a value as public APIs may assume that a char is never a surrogate. I'm uncertain whether that's worth it.

We'd also have to at least add "you must not use it as the scrutinee in a(n exhaustive(?)) match", e.g.

let c: char = transmute(0xD800_u32); // current UB
match c {
   '\0'..='\u{d7ff}' => {},
   '\u{e000}'.. => {},
   _ => {}, // is this arm taken, or is this match UB?
}
// this match is definitely UB
match c {
   '\0'..='\u{d7ff}' => {},
   '\u{e000}'.. => {},
}

(Not to mention other places where exhaustive patterns are allowed, e.g. let ('\0'..='\u{d7ff}' | '\u{e000}'..) = c;)

@CAD97
Copy link

CAD97 commented May 26, 2024

All else being equal, simpler validity conditions are preferable. While the validity specification of u32 in 0..0x100000 is arguably simpler than u32 in (0..0xD800 | 0xDC00..0x100000), the simplification is not really a meaningful one.

So the question imo should be to ask what the benefit of either choice is. And there I believe having the valid range exclude the surrogate range wins out, since code that wants access to the surrogate can just use u32, and while rustc can't utilize the internal niche yet, having it makes future safe manipulation of that niche a possibility.

There's also a further simplicity benefit to primitives having equivalent safety and validity. References definitely break that equivalence (and this remains somewhat contentious) and pointers are likely to (is the vtable matching a safety or a validity thing), but char has no reason to violate this equivalence.

@celinval
Copy link
Author

celinval commented Jun 4, 2024

Thanks everyone for all the answers. First, I wanted to clarify my question, and I'll use @zachs18 example for that. My question was whether UB happens should happen when we produce the surrogate point (transmute call) or when we read it (match). I.e.: no matter what, the fallback branch should be treated as unreachable.

let c: char = unsafe { transmute(0xD800_u32) }; // UB due to producing invalid value
match c { // UB due to using unsafe value
   '\0'..='\u{d7ff}' => {},
   '\u{e000}'.. => {},
   _ => {}, // This is unreachable no mater what.
}

If I understood this correctly, in practice, there's no immediate UB triggered today by producing a char value that is a surrogate point.

That said, I do agree that having equivalent safety and validity requirements for primitive types is much simpler and easy to reason about. So maybe that's enough to close this case?

Thanks again!

@digama0
Copy link

digama0 commented Jun 4, 2024

As far as I know, the current situation is that this is immediate UB at the construction of c (i.e. the first line), and the match is not doing anything unsafe (the typed copy happened already on the previous line). The discussions above are about whether we might change this, but the status quo is that this is UB, even though rustc is not currently optimizing based on it.

@RalfJung
Copy link
Member

RalfJung commented Jun 5, 2024

You can also easily use Miri to confirm this.

@celinval
Copy link
Author

celinval commented Jun 5, 2024

I think I'm missing one "should" in my sentence above. I get that the current UB is very well defined. Instead of:

My question was whether UB happens when we produce the surrogate point (transmute call) or when we read it (match).

Please read:

My question was whether UB should happen when we produce the surrogate point (transmute call) or when we read it (match).

I edited my comment above to make it more clear.

@saethlin
Copy link
Member

saethlin commented Jun 5, 2024

I don't know what you mean by "should". You could be asking about at which point it is common to see execution go haywire in practice, or you could be asking about whether it is better to specify that the illegal operation is the transmute, or the illegal operation is the match. English is awful. I'll try to answer both ways.

The illegal operation should be the transmute. It is a fundamental property of char that its value is not in this range, so attempting to construct a char with a value in the forbidden range breaks a validity invariant. This is the "constructing an invalid value" that The Reference alludes to.

In practice you will tend to see the program explode at the match statement, because the match will be lowered to some kind of jump table structure which jumps into some unforseen code offset if the value is in the niche. But of course in practice it is entirely possible to get a SIGILL or SIGEGV or other errant behavior before the transmute; UB is a property of the entire program execution. The compiler can use that transmute to reason backwards about what range the value of the input must have been, and often the goal of such optimization is to propagate unreachability backwards along control flow to clean up dead code.

@RalfJung
Copy link
Member

RalfJung commented Jun 6, 2024

I think @celinval is asking why we don't change the spec to make the match be the invalid operation.

@celinval
Copy link
Author

I'm OK keeping things as they are right now. May I close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants