Explain the relationship between windows-1252, Latin1, and ASCII #345

domenic · 2025-04-11T01:58:11Z

encoding.bs

annevk · 2025-04-11T06:57:29Z

encoding.bs

+ web-compatible by implementing the Encoding Standard, these are synonyms: "<code>latin1</code>" and
+ "<code>ascii</code>" are just labels for <a>windows-1252</a>, and any software following this
+ standard will, for example, decode 0x80 as U+20AC (€) when asked for the Latin1 or ASCII decoding
+ of that byte.


I think overall this is probably okay, but what gives me pause is that the Encoding standard doesn't define Latin1 or ASCII encodings (it only defines them as labels). So if software exposes those encodings, who knows what they might do. So perhaps we should make that distinction clearer, in that this will likely happen for software that takes a label and some bytes as input.

I tried to phrase this carefully to avoid giving the impression that latin1 or ASCII are encodings, and instead be clear that they are inputs to the common algorithm category that takes (byte sequence, encoding label) parameters.

On the web that algorithm category is well-formalized with the concepts of actual encodings vs. labels, but in larger software it's more vague with e.g. functions named DecodeLatin1 or similar.

My attempt was "when asked for the Latin1 or ASCII decoding of that byte", but if you have a different suggestion I'd be interested. The main thing is that I don't want to only constrain us to describing the web API case where we have a clear label/encoding divide, but instead the more general category of "please decode some bytes" algorithms across all software.

The problem I have with this is that browsers typically have "Latin1" code paths that are very much aligned with the Unicode view of the world and not windows-1252. So for complicated software it very much depends on how or what you ask.

I also don't really have a good rephrasing that would account for that. Maybe put Latin1 and ASCII in quotes like below?

I don't think I fully understand what's making you uneasy, but I think adding the quotes is reasonable, so I'll do that.

Well, browsers typically have Latin1 or ASCII encoding implementations that don't do windows-1252. But obviously they also "Latin1" and "ascii" labels to windows-1252. So they're on both sides of the divide you're trying to draw.

encoding.bs

aphillips · 2025-04-11T14:56:07Z

I added the tracker label to remind myself to come back and make a longer comment/suggestion. While Encoding really doesn't need to present the "complete and accurate history of character encodings" to explain the relationship between 1252 and L1, I do think that the text proposed here might not be clear enough/direct enough. Users of "other software" still often make use of 8859-1 as an isomorphic encoding outside of the Web context and I have encountered people who are pedantic about the Official Meaning of the 8859-1 label. I think the note being added here could be tighter and clearer about this. In particular, I find it misleading (even if factual) to say that ISO/IEC 8859-1 didn't provide mappings for the C0/C1 range and DEL. Every encoder I've ever met has mapped these isomorphically to Unicode, which puts controls at those code points.

This is the kind of thing which might merit a short external document in the I18N space so that Encoding can point (for people who want history) while focusing on "Latin1 is not an isomorphic encoding on the Web, eh?"

domenic · 2025-04-12T03:51:45Z

Thanks @aphillips. I hope we can move forward with something basically in this form though without major objections.

In particular, I find it misleading (even if factual) to say that ISO/IEC 8859-1 didn't provide mappings for the C0/C1 range and DEL.

I don't understand this point. I think it's pretty important to be clear about how the original standards didn't give mappings here, and I don't think it's misleading.

Every encoder I've ever met has mapped these isomorphically to Unicode, which puts controls at those code points.

Well, except for all the ones that implement the Encoding Standard, including web browsers, right? Your statement seems pretty wrong in light of that.

aphillips · 2025-04-13T17:46:50Z

Well, except for all the ones that implement the Encoding Standard, including web browsers, right? Your statement seems pretty wrong in light of that.

You're right. I did not mean those implementing the Encoding Standard. I had in mind non-Web converters that implement Latin-1 as an actual independent encoding.

I think it's pretty important to be clear about how the original standards didn't give mappings here

I think that's the primary thing that caused my reaction. ISO 8859 didn't specify characters in the C0 and C1 ranges, but it reserved those code points with an expectation that they'd be filled in by ISO 6429. It wasn't like escapes and controls were unknown (particularly the C0 variety). In practice, coders incorporated the controls. Unicode itself is isomorphic with 8859-1 plus 6429.

I don't think I need to repeat the history and why Encoding is based on windows-1252. What I was trying to say hurriedly before was basically: at least some developers remain familiar with non-Encoding-based coders (iconv, JDK, ICU, various databases, etc. etc.) where the label for ISO 8859-1 is an isomorphic encoding. Those developers may have used the encoding to smuggle bytes into or out of strings... and Encoding breaks that assumption (which is the whole point of your PR).

All that said... I re-read the text you have this morning, with an eye towards suggesting replacement text, and it seemed adequate this time around. So never mind. I will still raise this with I18N to decide if we might provide some exterior documentation about the history, but nothing Encoding would need to pay attention to or wait on.

encoding.bs

annevk · 2025-04-15T09:42:36Z

encoding.bs

+ web-compatible by implementing the Encoding Standard, these are synonyms: "<code>latin1</code>" and
+ "<code>ascii</code>" are just labels for <a>windows-1252</a>, and any software following this
+ standard will, for example, decode 0x80 as U+20AC (€) when asked for the Latin1 or ASCII decoding
+ of that byte.


The problem I have with this is that browsers typically have "Latin1" code paths that are very much aligned with the Unicode view of the world and not windows-1252. So for complicated software it very much depends on how or what you ask.

I also don't really have a good rephrasing that would account for that. Maybe put Latin1 and ASCII in quotes like below?

encoding.bs

annevk

Looks good modulo formatting nits.

annevk · 2025-04-16T09:40:38Z

tools-label-table.py

+   <td{rowspan}>
+    <a>{encoding["name"]}</a>
+    <p class=note>See <a href="#note-latin1-ascii">below</a> for the relationship to historical
+    "Latin1" and "ASCII" concepts."""


I suspect this is missing a newline at the end.

I added one, and indeed the Python script output is more consistent now, but the main text doesn't match the Python script output: all the blank lines are omitted. Not sure what to do about that, if anything.

I took a more detailed look and pushed a fixup. Should be okay to land now from my perspective.

encoding.bs

Explain the relationship between windows-1252, Latin1, and ASCII

ef6c000

domenic added the clarification Standard could be clearer label Apr 11, 2025

annevk reviewed Apr 11, 2025

View reviewed changes

aphillips added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Apr 11, 2025

w3cbot mentioned this pull request Apr 11, 2025

Explain the relationship between windows-1252, Latin1, and ASCII w3c/i18n-activity#2000

Closed

Update the script too

d020321

annevk reviewed Apr 15, 2025

View reviewed changes

Respond to review comments

bf8f814

annevk approved these changes Apr 16, 2025

View reviewed changes

domenic and others added 2 commits April 17, 2025 10:01

Formatting issues

8508f52

fix formatting

cf74334

annevk merged commit 36fb4e7 into main Apr 23, 2025
2 checks passed

annevk deleted the latin1-explanation branch April 23, 2025 07:46

Explain the relationship between windows-1252, Latin1, and ASCII #345

Explain the relationship between windows-1252, Latin1, and ASCII #345

Uh oh!

Conversation

domenic commented Apr 11, 2025 • edited by pr-preview bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aphillips commented Apr 11, 2025

Uh oh!

domenic commented Apr 12, 2025

Uh oh!

aphillips commented Apr 13, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

annevk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

domenic commented Apr 11, 2025 •

edited by pr-preview bot

Loading