Redefine char and unicodeChar for correct Unicode usage #71

mbtaylor · 2025-08-07T07:41:12Z

char primitives are now UTF-8-encoded bytes,
unicodeChar primitives are now UTF-16-encoded byte pairs for BMP characters (non-BMP is excluded)
unicodeChar is deprecated
a little bit of surrounding text is rephrased to match Unicode concepts
there is some non-normative text explaining the implications of UTF-8 being a variable-width encoding
references to UCS-2 are removed

- char primitives are now UTF-8-encoded bytes, - unicodeChar primitives are now UTF-16-encoded byte pairs for BMP characters (non-BMP is excluded) - unicodeChar is deprecated - a little bit of surrounding text is rephrased to match Unicode concepts - there is some non-normative text explaining the implications of UTF-8 being a variable-width encoding - references to UCS-2 are removed

mbtaylor · 2025-08-07T07:51:17Z

This is the PR that I threatened on the Apps mailing list on 17 July, and follows on from the discussion on that thread and from @msdemlei's presentation in College Park.

It tackles issues #55 and #69; covering both getting rid of references to UCS-2 and redefining the char datatype so that it can be used for UTF-8. It is incompatible with PR #68 (if this one is accepted, that one should be retired).

rra

In the world of Unicode RFCs, the standards are fairly careful to always use the term "octet" for the individual 8-bit storage unit of an encoding. This text uses "byte" throughout. I think that's a reasonable choice these days, given that all the computing architectures that used non-8-bit bytes are very obsolete, but it might be worth aligning the terminology on octet just to avoid any confusion for readers who are moving back and forth between the RFC world and the IVOA standard world and might wonder if there's some difference between byte and octet.

I'm not sure where to put this in the document, but it feels worthwhile to add a fairly explicit warning for char that if a value is truncated to fit a length restriction on the column, it may be shorter than the number of octets given in arraysize, and therefore implementations cannot use length == arraysize as a flag to detect possibly truncated values.

rra · 2025-08-07T16:56:03Z

VOTable.tex

+For this type the primitive size of two bytes corresponds to a 2-byte
+UTF-16 {\em code unit}.
+Only characters in the Unicode Basic Multilingual Plane,
+which all have 2-byte representations, are permitted for this datatype,
+so that the primitive count matches the character count.


I know that you were trying to drop all the UCS-2 references, but I think you've essentially defined UCS-2 here without using that term explicitly. Maybe it would be a bit easier for implementers to understand if the text says that this is UCS-2? As far as I understand it, UCS-2 is exactly UTF-16 with all planes other than the Unicode Basic Multilingual Plane banned so that every character is exactly two octets.

Should there be a recommendation here about what implementations should do if given unicodeChar that is actually in UTF-16 and therefore contains surrogate pairs? I know that we would like to leave unicodeChar behind us, but we discovered that current PyVO generates unicodeChar fields when given a CSV table to upload, so we may be living with it for a while and I bet implementations will encounter people uploading higher plane characters in the wild.

I'm happy to revert the language to UCS-2 (which is indeed identical to BMP-only UTF-16) if people think that's more comprehensible.

It may be that there are unicodeChar fields containing higher-plane characters out there somewhere, but that would have been illegal for earlier versions of VOTable and would be illegal with this version too. Given that, I feel like software is within its rights to do whatever it likes... but in practice it's likely to be UTF-16 so using a UTF-16 decoder would probably do the right thing even in the presence of illegal data, so maybe it's worth recommending that.

I'm happy to revert the language to UCS-2 (which is indeed identical to BMP-only UTF-16) if people think that's more comprehensible.

I didn't express that well -- I like the change and I think it's great to connect this to UTF-16 because UTF-16 encoders are readily available but UCS-2 encoders may be a bit rarer. I was just thinking that it might be good to also note that UTF-16 with this restriction is just UCS-2.

It may be that there are unicodeChar fields containing higher-plane characters out there somewhere, but that would have been illegal for earlier versions of VOTable and would be illegal with this version too. Given that, I feel like software is within its rights to do whatever it likes... but in practice it's likely to be UTF-16 so using a UTF-16 decoder would probably do the right thing even in the presence of illegal data, so maybe it's worth recommending that.

I like the idea of saying explicitly that you can use a UTF-16 decoder and accept technically invalid VOTables that contain surrogate pairs if you want to be generous in what you accept and can handle UTF-16, but when creating a VOTable, you must not include surrogate pairs.

rra · 2025-08-07T16:58:33Z

VOTable.tex

+the 2-byte big-endian UTF-16 encoding
+of a Unicode character from the Basic Multilingual Plane.


Here too, I think this is just another way of saying UCS-2.

gpdf · 2025-08-07T19:10:42Z

I've been trying to read a variety of sources on UCS-2, all non-authoritative, and have not been able to get a completely clear answer to the question of whether there are any valid UCS-2 code points that would be interpreted differently in UTF-16.

This is roughly, but not precisely, equivalent to asking whether U+D800 - U+DFFF had always been reserved historically, even though UTF-16 didn't come along until later. Virtually all sources I can find are backward-looking, writing from the post-UTF-16 perspective, and just don't address this.

gpdf · 2025-08-07T19:13:34Z

VOTable.tex

+Note that the primitive size of one byte refers to a single
+UTF-8-encoded byte, not to a single character.
+Since UTF-8 is a variable-width encoding,
+a character may require multiple bytes, and for arrays the
+string length (length in characters) and primitive count (length in bytes)
+will in general differ.
+7-bit ASCII characters are however all encoded as a single byte in UTF-8,
+so in the case of ASCII characters, which were required for this
+datatype in earlier VOTable versions, the primitive and character count
+are equal.


Perhaps I'm overlooking it, and it's already there, but I think it might be worth an explicit statement in the text that clarifies that a bare char without a length (a one-octet string) is limited to being able to store an ASCII character.

I've added a sentence at 73bfd13 clarifying this. @fxpineau made a similar suggestion.

rra · 2025-08-07T19:48:36Z

I've been trying to read a variety of sources on UCS-2, all non-authoritative, and have not been able to get a completely clear answer to the question of whether there are any valid UCS-2 code points that would be interpreted differently in UTF-16.

This is roughly, but not precisely, equivalent to asking whether U+D800 - U+DFFF had always been reserved historically, even though UTF-16 didn't come along until later. Virtually all sources I can find are backward-looking, writing from the post-UTF-16 perspective, and just don't address this.

It's hard to find a formal definition of UCS-2 now because the UCS stuff is basically deprecated, but my understanding is that it is a direct mapping of the Unicode code points to a two-octet number per code point (and comes in either a big-endian or a little-endian variant). UTF-16 is defined to be exactly that mapping (see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G31699) except for surrogate pairs, and the code space for surrogate codes used in surrogate pairs is reserved for that purpose exclusively in https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-23/#G24089.

Following comments from Russ Allberry and Gregory D-F, make a couple of adjustments to the description of the (now deprecated) unicodeChar type: - note that the required encoding is just a rebadged UCS-2 - explicitly allow readers to treat it as standard UTF-16

mbtaylor · 2025-08-08T09:22:19Z

I've added a commit 7a17c37 that I think incorporates @rra's suggestions about unicodeChar description and usage recommendations.

mbtaylor · 2025-08-08T10:11:56Z

In the world of Unicode RFCs, the standards are fairly careful to always use the term "octet" for the individual 8-bit storage unit of an encoding. This text uses "byte" throughout. I think that's a reasonable choice these days, given that all the computing architectures that used non-8-bit bytes are very obsolete, but it might be worth aligning the terminology on octet just to avoid any confusion for readers who are moving back and forth between the RFC world and the IVOA standard world and might wonder if there's some difference between byte and octet.

I have to confess to working from wikipedia rather than the RFCs for my Unicode information; the wikipedia pages on e.g. UTF-8 and UTF-16 use the term "byte" throughout with no mention of octets. I feel like if it's good enough for wikipedia it's probably good enough here, and given that the rest of the VOTable document uses the term byte throughout as well, I think confusion will be minimised by leaving the terminology as is. But if majority opinion is against me I'll change it.

mbtaylor · 2025-08-08T11:21:24Z

I'm not sure where to put this in the document, but it feels worthwhile to add a fairly explicit warning for char that if a value is truncated to fit a length restriction on the column, it may be shorter than the number of octets given in arraysize, and therefore implementations cannot use length == arraysize as a flag to detect possibly truncated values.

I've tried to address this in c9859ed.

rra

This version looks great to me.

mbtaylor requested review from msdemlei and removed request for msdemlei August 7, 2025 08:00

rra reviewed Aug 7, 2025

View reviewed changes

gpdf reviewed Aug 7, 2025

View reviewed changes

clarify single char can only store 7-bit ASCII characters

73bfd13

Note UTF-8 array truncation issue

c9859ed

mbtaylor force-pushed the unicode branch from 339faf1 to c9859ed Compare August 8, 2025 11:16

rra approved these changes Aug 8, 2025

View reviewed changes

stvoutsin mentioned this pull request Aug 12, 2025

VOTable export should default to char datatype instead of unicodeChar for string columns astropy/astropy#18515

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redefine char and unicodeChar for correct Unicode usage #71

Redefine char and unicodeChar for correct Unicode usage #71

Uh oh!

mbtaylor commented Aug 7, 2025

Uh oh!

mbtaylor commented Aug 7, 2025

Uh oh!

rra left a comment

Uh oh!

rra Aug 7, 2025

Uh oh!

mbtaylor Aug 7, 2025

Uh oh!

rra Aug 7, 2025

Uh oh!

rra Aug 7, 2025

Uh oh!

gpdf commented Aug 7, 2025

Uh oh!

gpdf Aug 7, 2025

Uh oh!

mbtaylor Aug 8, 2025

Uh oh!

rra commented Aug 7, 2025 •

edited

Loading

Uh oh!

mbtaylor commented Aug 8, 2025

Uh oh!

mbtaylor commented Aug 8, 2025

Uh oh!

mbtaylor commented Aug 8, 2025

Uh oh!

rra left a comment

Uh oh!

Uh oh!

		the 2-byte big-endian UTF-16 encoding
		of a Unicode character from the Basic Multilingual Plane.

Redefine char and unicodeChar for correct Unicode usage #71

Are you sure you want to change the base?

Redefine char and unicodeChar for correct Unicode usage #71

Uh oh!

Conversation

mbtaylor commented Aug 7, 2025

Uh oh!

mbtaylor commented Aug 7, 2025

Uh oh!

rra left a comment

Choose a reason for hiding this comment

Uh oh!

rra Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

mbtaylor Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

rra Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

rra Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

gpdf commented Aug 7, 2025

Uh oh!

gpdf Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

mbtaylor Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

rra commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbtaylor commented Aug 8, 2025

Uh oh!

mbtaylor commented Aug 8, 2025

Uh oh!

mbtaylor commented Aug 8, 2025

Uh oh!

rra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rra commented Aug 7, 2025 •

edited

Loading