Wording in 9.10.2 is misleading and strictly saying incorrect. #160

crlf0710 · 2022-02-23T05:40:46Z

Describe the bug

In 9.10.2 Paragraph 1 there's this sentence: "A PDF processor can use these methods, in the priority given, to map a character code to a Unicode value." Should be, "to map a character code to a Unicode character sequence"(following the wording in 9.10.3 ToUnicode CMaps clauses).

This is because it is possible to specify multiple multiple Unicode values or UTF16-BE encoding code units in a row in ToUnicode CMaps and mapping-resources-pdf CMaps, thus a Unicode character sequence. This new wording also covers the AGLFN cases well.
Also in paragraph 1, there's this sentence: "Tagged PDF documents, in particular, shall provide at least one of these methods." I believe this sentence is misleading and/or incorrect. First providing methods is the duty of processors not documents. Second, it's totally fine for a Tagged PDF document that doesn't exercise the following methods at all in that section if the document is fully annotated by AltText mechanism.
In paragraph 9 there's this sentence: "Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.". This wording is correct at the moment but not future-proof. It is definitely possible and useful in the future that mapping-resources-pdf CMaps use Unicode character sequences, for cases like UVSs. (Ken Lunde has expressed interest in doing so in the past in [Adobe-Japan1-UCS2] Suggested changes adobe-type-tools/mapping-resources-pdf#6 (comment), which references https://ccjktype.fonts.adobe.com/2019/05/to-uvs-or-not-to-uvs.html , we all know this work has stalled, but i think the direction is correct and it's a useful provision.)

The text was updated successfully, but these errors were encountered:

petervwyatt · 2022-02-25T04:20:06Z

Good suggestion.
I agree - a PDF is not required to contain text or fonts. Also, the "shall" statement here is also at odds with the softer "should" statement in the very first sentence in 14.8.2.6. Maybe we can switch it around to a more factually worded statement supporting the 14.8.2.6 wording: "Tagged PDF documents that contain text should provide Unicode mappings (see 14.8.2.6 ...)". Also noting that 14.8.2.6 already has a back-reference to 9.10.2.
I'm not sure what concrete suggestion you're making... do you mean that the singular "... producing a Unicode value" should really support plural outputs - as in: "... producing Unicode values"?

MatthiasValvekens · 2022-02-25T08:26:02Z

I'm not sure what concrete suggestion you're making... do you mean that the singular "... producing a Unicode value" should really support plural outputs - as in: "... producing Unicode values"?

Yes, that's how I read it as well. Right now, the "predefined" ToUnicode CMaps all happen to have mappings sending one CID to one Unicode codepoint, but that may well change in the future (for the reasons explained in Ken Lunde's post linked by @crlf0710). So I would say that replacing "producing a Unicode value" with "producing a Unicode character sequence" is more appropriate.

I always forget whether "character" or "codepoint" is the term of choice here, so perhaps someone more intimately familiar with the internal conventions of Unicode can comment on that.

crlf0710 · 2022-02-25T13:24:08Z

3. I'm not sure what concrete suggestion you're making... do you mean that the singular "... producing a Unicode value" should really support plural outputs - as in: "... producing Unicode values"?

Yes, @MatthiasValvekens explained my thought well.

About term of choice above, for text extraction purposes I'd prefer "character" over "codepoint" a bit because this avoid the needs to talk about codepoints not assigned to characters (Surrogate, Noncharacter, Reserved).

petervwyatt · 2022-02-28T09:19:16Z

To summarize the proposed solutions:

"A PDF processor can use these methods, in the priority given, to map a character code to a Unicode character sequence".
"Tagged PDF documents that contain text should provide Unicode mappings (see 14.8.2.6 ...)".
"Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode character sequence."

petervwyatt · 2022-04-28T20:31:34Z

PDF TWG identify multiple issues with this clause and would like to remove this section from ISO 32000 in favour of a new more easily updatable document. Will revisit this Issue after that discussion (if necessary).

MatthiasValvekens · 2022-05-19T21:04:10Z

Just as a quick update: upon reviewing this clause, WG8 determined that it was only about mapping individual character codes to Unicode values, and not so much about text extraction in general. Its only purpose is to list off possible methods to perform that mapping. As such, the committee concluded that no changes to ISO 32000 were necessary. WG8 also noted that a best practices guide for text extraction would be more at home in a vendor body like the PDF Association, as opposed to a WG8 project.

(I hope that that's a more or less correct summary; I'm going off my recollection of the meeting here, which may or may not be accurate.)

petervwyatt · 2022-05-20T06:41:22Z

As noted by @MatthiasValvekens, WG8 desires not to change the current wording via an errata so this issue will be closed as "no Fix".

However, if someone wishes to volunteer to lead new work inside the PDF Association to create a best practices guide for text extraction then please re-open this issue.

crlf0710 added the bug Something isn't correct label Feb 23, 2022

petervwyatt self-assigned this Feb 28, 2022

petervwyatt added the proposed solution Proposed solution is ready for review label Feb 28, 2022

petervwyatt added this to the Font and text related milestone Mar 7, 2022

petervwyatt assigned mrbhardy Mar 17, 2022

petervwyatt closed this as completed May 20, 2022

petervwyatt added documentation Improvements or additions to documentation wontfix This issue did not result in any spec changes and removed bug Something isn't correct proposed solution Proposed solution is ready for review labels May 20, 2022

petervwyatt removed the documentation Improvements or additions to documentation label Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wording in 9.10.2 is misleading and strictly saying incorrect. #160

Wording in 9.10.2 is misleading and strictly saying incorrect. #160

crlf0710 commented Feb 23, 2022 •

edited

Loading

petervwyatt commented Feb 25, 2022

MatthiasValvekens commented Feb 25, 2022

crlf0710 commented Feb 25, 2022

petervwyatt commented Feb 28, 2022

petervwyatt commented Apr 28, 2022

MatthiasValvekens commented May 19, 2022

petervwyatt commented May 20, 2022

Wording in 9.10.2 is misleading and strictly saying incorrect. #160

Wording in 9.10.2 is misleading and strictly saying incorrect. #160

Comments

crlf0710 commented Feb 23, 2022 • edited Loading

petervwyatt commented Feb 25, 2022

MatthiasValvekens commented Feb 25, 2022

crlf0710 commented Feb 25, 2022

petervwyatt commented Feb 28, 2022

petervwyatt commented Apr 28, 2022

MatthiasValvekens commented May 19, 2022

petervwyatt commented May 20, 2022

crlf0710 commented Feb 23, 2022 •

edited

Loading