-
Notifications
You must be signed in to change notification settings - Fork 83
Explain the relationship between windows-1252, Latin1, and ASCII #345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
encoding.bs
Outdated
web-compatible by implementing the Encoding Standard, these are synonyms: "<code>latin1</code>" and | ||
"<code>ascii</code>" are just labels for <a>windows-1252</a>, and any software following this | ||
standard will, for example, decode 0x80 as U+20AC (€) when asked for the Latin1 or ASCII decoding | ||
of that byte. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think overall this is probably okay, but what gives me pause is that the Encoding standard doesn't define Latin1 or ASCII encodings (it only defines them as labels). So if software exposes those encodings, who knows what they might do. So perhaps we should make that distinction clearer, in that this will likely happen for software that takes a label and some bytes as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to phrase this carefully to avoid giving the impression that latin1 or ASCII are encodings, and instead be clear that they are inputs to the common algorithm category that takes (byte sequence, encoding label) parameters.
On the web that algorithm category is well-formalized with the concepts of actual encodings vs. labels, but in larger software it's more vague with e.g. functions named DecodeLatin1
or similar.
My attempt was "when asked for the Latin1 or ASCII decoding of that byte", but if you have a different suggestion I'd be interested. The main thing is that I don't want to only constrain us to describing the web API case where we have a clear label/encoding divide, but instead the more general category of "please decode some bytes" algorithms across all software.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem I have with this is that browsers typically have "Latin1" code paths that are very much aligned with the Unicode view of the world and not windows-1252. So for complicated software it very much depends on how or what you ask.
I also don't really have a good rephrasing that would account for that. Maybe put Latin1 and ASCII in quotes like below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I fully understand what's making you uneasy, but I think adding the quotes is reasonable, so I'll do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, browsers typically have Latin1 or ASCII encoding implementations that don't do windows-1252. But obviously they also "Latin1" and "ascii" labels to windows-1252. So they're on both sides of the divide you're trying to draw.
I added the tracker label to remind myself to come back and make a longer comment/suggestion. While Encoding really doesn't need to present the "complete and accurate history of character encodings" to explain the relationship between 1252 and L1, I do think that the text proposed here might not be clear enough/direct enough. Users of "other software" still often make use of 8859-1 as an isomorphic encoding outside of the Web context and I have encountered people who are pedantic about the Official Meaning of the 8859-1 label. I think the note being added here could be tighter and clearer about this. In particular, I find it misleading (even if factual) to say that ISO/IEC 8859-1 didn't provide mappings for the C0/C1 range and DEL. Every encoder I've ever met has mapped these isomorphically to Unicode, which puts controls at those code points. This is the kind of thing which might merit a short external document in the I18N space so that Encoding can point (for people who want history) while focusing on "Latin1 is not an isomorphic encoding on the Web, eh?" |
Thanks @aphillips. I hope we can move forward with something basically in this form though without major objections.
I don't understand this point. I think it's pretty important to be clear about how the original standards didn't give mappings here, and I don't think it's misleading.
Well, except for all the ones that implement the Encoding Standard, including web browsers, right? Your statement seems pretty wrong in light of that. |
You're right. I did not mean those implementing the Encoding Standard. I had in mind non-Web converters that implement Latin-1 as an actual independent encoding.
I think that's the primary thing that caused my reaction. ISO 8859 didn't specify characters in the C0 and C1 ranges, but it reserved those code points with an expectation that they'd be filled in by ISO 6429. It wasn't like escapes and controls were unknown (particularly the C0 variety). In practice, coders incorporated the controls. Unicode itself is isomorphic with 8859-1 plus 6429. I don't think I need to repeat the history and why Encoding is based on windows-1252. What I was trying to say hurriedly before was basically: at least some developers remain familiar with non-Encoding-based coders (iconv, JDK, ICU, various databases, etc. etc.) where the label for ISO 8859-1 is an isomorphic encoding. Those developers may have used the encoding to smuggle bytes into or out of strings... and Encoding breaks that assumption (which is the whole point of your PR). All that said... I re-read the text you have this morning, with an eye towards suggesting replacement text, and it seemed adequate this time around. So never mind. I will still raise this with I18N to decide if we might provide some exterior documentation about the history, but nothing Encoding would need to pay attention to or wait on. |
encoding.bs
Outdated
web-compatible by implementing the Encoding Standard, these are synonyms: "<code>latin1</code>" and | ||
"<code>ascii</code>" are just labels for <a>windows-1252</a>, and any software following this | ||
standard will, for example, decode 0x80 as U+20AC (€) when asked for the Latin1 or ASCII decoding | ||
of that byte. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem I have with this is that browsers typically have "Latin1" code paths that are very much aligned with the Unicode view of the world and not windows-1252. So for complicated software it very much depends on how or what you ask.
I also don't really have a good rephrasing that would account for that. Maybe put Latin1 and ASCII in quotes like below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good modulo formatting nits.
<td{rowspan}> | ||
<a>{encoding["name"]}</a> | ||
<p class=note>See <a href="#note-latin1-ascii">below</a> for the relationship to historical | ||
"Latin1" and "ASCII" concepts.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this is missing a newline at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added one, and indeed the Python script output is more consistent now, but the main text doesn't match the Python script output: all the blank lines are omitted. Not sure what to do about that, if anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a more detailed look and pushed a fixup. Should be okay to land now from my perspective.
Preview | Diff