Add support for UTF-8/UTF-16 strings through DOCS #18

emwl · 2025-06-24T08:09:21Z

The most common way of switching the encoding is done through ESC sequences in front of the string, and while the specification has a lot of shebang about different character sets in G0 thru G3, with control planes C0/C1 etc., the majority of files I've seen so far (including the ones from the WebCGM Test Suite) simply use DOCS (DECIDE OTHER CODING SYSTEM) as ISO/IEC 2022 and ECMA-35 describe it. At least I haven't seen any files that use CHARACTER SET LIST and CHARACTER SET INDEX so far.

With the first commit, the CGM object keeps track of the currently active encoding, starting out with ISO-8859-1 (as ISO/IEC 8632-1 §6.3.4.5 indicates, and the old code did). The fallback is still there, so even when this fails, we just get the same result as before (which is likely garbage/mojibake that looks the individual bytes instead).

And while trying to verify this with Analyzer, the log output wasn't really useful - so the second commit makes those output files use UTF-8 instead. That way, any multi-byte values (or other encodings that aren't ASCII/ISO-8859-1) show up correctly there while staying mostly the same as before if the output has no multi-byte characters.

I'm not too sure about the third commit, but I noticed this in the WebCGM Test Suite files, especially the ones created with IsoDraw. FONT LIST has a list of String-Fixed values (see ISO/IEC 8632-1 §7.3.13) but those files use the same multi-byte encoding as a regular String would. https://github.com/BhaaLseN/CgmInfo/ also uses a regular string for it.
I don't think there's any downsides to this, because the fallback is still there; just let me know if you want to keep that last commit or not.

Also, a quick disclaimer: I mainly used Analyzer and the text output to test this, I haven't yet tried to render an image with it. But I assume that the drawing routines should do the right thing there.

some metafile generators (such as IsoDraw) write this as regular string, using the current encoding. makeString handles this well enough to just switch over to it.

emwl · 2025-06-26T05:51:00Z

Took me a bit longer than anticipated, but I finally got around to test this visually:

Not sure what you think, but I'd argue that's an improvement :)

emwl added 3 commits June 22, 2025 14:43

add support for utf-8/utf-16 encoded strings

3e2726d

Analyzer: use UTF-8 for the output files to support multibyte characters

005a07f

allow FontList parsing to be more forgiving

db2dbff

some metafile generators (such as IsoDraw) write this as regular string, using the current encoding. makeString handles this well enough to just switch over to it.

BhaaLseN mentioned this pull request Oct 10, 2025

After converting CGM to PNG, the image display is chaotic #19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for UTF-8/UTF-16 strings through DOCS #18

Add support for UTF-8/UTF-16 strings through DOCS #18

Uh oh!

emwl commented Jun 24, 2025

Uh oh!

emwl commented Jun 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add support for UTF-8/UTF-16 strings through DOCS #18

Are you sure you want to change the base?

Add support for UTF-8/UTF-16 strings through DOCS #18

Uh oh!

Conversation

emwl commented Jun 24, 2025

Uh oh!

emwl commented Jun 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant