Skip to content

Explain the relationship between windows-1252, Latin1, and ASCII #345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 23, 2025

Conversation

domenic
Copy link
Member

@domenic domenic commented Apr 11, 2025

@domenic domenic added the clarification Standard could be clearer label Apr 11, 2025
encoding.bs Outdated
web-compatible by implementing the Encoding Standard, these are synonyms: "<code>latin1</code>" and
"<code>ascii</code>" are just labels for <a>windows-1252</a>, and any software following this
standard will, for example, decode 0x80 as U+20AC (€) when asked for the Latin1 or ASCII decoding
of that byte.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think overall this is probably okay, but what gives me pause is that the Encoding standard doesn't define Latin1 or ASCII encodings (it only defines them as labels). So if software exposes those encodings, who knows what they might do. So perhaps we should make that distinction clearer, in that this will likely happen for software that takes a label and some bytes as input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to phrase this carefully to avoid giving the impression that latin1 or ASCII are encodings, and instead be clear that they are inputs to the common algorithm category that takes (byte sequence, encoding label) parameters.

On the web that algorithm category is well-formalized with the concepts of actual encodings vs. labels, but in larger software it's more vague with e.g. functions named DecodeLatin1 or similar.

My attempt was "when asked for the Latin1 or ASCII decoding of that byte", but if you have a different suggestion I'd be interested. The main thing is that I don't want to only constrain us to describing the web API case where we have a clear label/encoding divide, but instead the more general category of "please decode some bytes" algorithms across all software.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I have with this is that browsers typically have "Latin1" code paths that are very much aligned with the Unicode view of the world and not windows-1252. So for complicated software it very much depends on how or what you ask.

I also don't really have a good rephrasing that would account for that. Maybe put Latin1 and ASCII in quotes like below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I fully understand what's making you uneasy, but I think adding the quotes is reasonable, so I'll do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, browsers typically have Latin1 or ASCII encoding implementations that don't do windows-1252. But obviously they also "Latin1" and "ascii" labels to windows-1252. So they're on both sides of the divide you're trying to draw.

@aphillips aphillips added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Apr 11, 2025
@aphillips
Copy link
Contributor

I added the tracker label to remind myself to come back and make a longer comment/suggestion. While Encoding really doesn't need to present the "complete and accurate history of character encodings" to explain the relationship between 1252 and L1, I do think that the text proposed here might not be clear enough/direct enough. Users of "other software" still often make use of 8859-1 as an isomorphic encoding outside of the Web context and I have encountered people who are pedantic about the Official Meaning of the 8859-1 label. I think the note being added here could be tighter and clearer about this. In particular, I find it misleading (even if factual) to say that ISO/IEC 8859-1 didn't provide mappings for the C0/C1 range and DEL. Every encoder I've ever met has mapped these isomorphically to Unicode, which puts controls at those code points.

This is the kind of thing which might merit a short external document in the I18N space so that Encoding can point (for people who want history) while focusing on "Latin1 is not an isomorphic encoding on the Web, eh?"

@domenic
Copy link
Member Author

domenic commented Apr 12, 2025

Thanks @aphillips. I hope we can move forward with something basically in this form though without major objections.

In particular, I find it misleading (even if factual) to say that ISO/IEC 8859-1 didn't provide mappings for the C0/C1 range and DEL.

I don't understand this point. I think it's pretty important to be clear about how the original standards didn't give mappings here, and I don't think it's misleading.

Every encoder I've ever met has mapped these isomorphically to Unicode, which puts controls at those code points.

Well, except for all the ones that implement the Encoding Standard, including web browsers, right? Your statement seems pretty wrong in light of that.

@aphillips
Copy link
Contributor

Well, except for all the ones that implement the Encoding Standard, including web browsers, right? Your statement seems pretty wrong in light of that.

You're right. I did not mean those implementing the Encoding Standard. I had in mind non-Web converters that implement Latin-1 as an actual independent encoding.

I think it's pretty important to be clear about how the original standards didn't give mappings here

I think that's the primary thing that caused my reaction. ISO 8859 didn't specify characters in the C0 and C1 ranges, but it reserved those code points with an expectation that they'd be filled in by ISO 6429. It wasn't like escapes and controls were unknown (particularly the C0 variety). In practice, coders incorporated the controls. Unicode itself is isomorphic with 8859-1 plus 6429.

I don't think I need to repeat the history and why Encoding is based on windows-1252. What I was trying to say hurriedly before was basically: at least some developers remain familiar with non-Encoding-based coders (iconv, JDK, ICU, various databases, etc. etc.) where the label for ISO 8859-1 is an isomorphic encoding. Those developers may have used the encoding to smuggle bytes into or out of strings... and Encoding breaks that assumption (which is the whole point of your PR).

All that said... I re-read the text you have this morning, with an eye towards suggesting replacement text, and it seemed adequate this time around. So never mind. I will still raise this with I18N to decide if we might provide some exterior documentation about the history, but nothing Encoding would need to pay attention to or wait on.

encoding.bs Outdated
web-compatible by implementing the Encoding Standard, these are synonyms: "<code>latin1</code>" and
"<code>ascii</code>" are just labels for <a>windows-1252</a>, and any software following this
standard will, for example, decode 0x80 as U+20AC (€) when asked for the Latin1 or ASCII decoding
of that byte.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I have with this is that browsers typically have "Latin1" code paths that are very much aligned with the Unicode view of the world and not windows-1252. So for complicated software it very much depends on how or what you ask.

I also don't really have a good rephrasing that would account for that. Maybe put Latin1 and ASCII in quotes like below?

Copy link
Member

@annevk annevk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good modulo formatting nits.

<td{rowspan}>
<a>{encoding["name"]}</a>
<p class=note>See <a href="#note-latin1-ascii">below</a> for the relationship to historical
"Latin1" and "ASCII" concepts."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is missing a newline at the end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added one, and indeed the Python script output is more consistent now, but the main text doesn't match the Python script output: all the blank lines are omitted. Not sure what to do about that, if anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a more detailed look and pushed a fixup. Should be okay to land now from my perspective.

@annevk annevk merged commit 36fb4e7 into main Apr 23, 2025
2 checks passed
@annevk annevk deleted the latin1-explanation branch April 23, 2025 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification Standard could be clearer i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.
Development

Successfully merging this pull request may close these issues.

3 participants