Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wording in 9.10.2 is misleading and strictly saying incorrect. #160

Closed
crlf0710 opened this issue Feb 23, 2022 · 7 comments
Closed

Wording in 9.10.2 is misleading and strictly saying incorrect. #160

crlf0710 opened this issue Feb 23, 2022 · 7 comments
Assignees
Labels
wontfix This issue did not result in any spec changes

Comments

@crlf0710
Copy link

crlf0710 commented Feb 23, 2022

Describe the bug

  1. In 9.10.2 Paragraph 1 there's this sentence: "A PDF processor can use these methods, in the priority given, to map a character code to a Unicode value." Should be, "to map a character code to a Unicode character sequence"(following the wording in 9.10.3 ToUnicode CMaps clauses).

    This is because it is possible to specify multiple multiple Unicode values or UTF16-BE encoding code units in a row in ToUnicode CMaps and mapping-resources-pdf CMaps, thus a Unicode character sequence. This new wording also covers the AGLFN cases well.

  2. Also in paragraph 1, there's this sentence: "Tagged PDF documents, in particular, shall provide at least one of these methods." I believe this sentence is misleading and/or incorrect. First providing methods is the duty of processors not documents. Second, it's totally fine for a Tagged PDF document that doesn't exercise the following methods at all in that section if the document is fully annotated by AltText mechanism.

  3. In paragraph 9 there's this sentence: "Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.". This wording is correct at the moment but not future-proof. It is definitely possible and useful in the future that mapping-resources-pdf CMaps use Unicode character sequences, for cases like UVSs. (Ken Lunde has expressed interest in doing so in the past in [Adobe-Japan1-UCS2] Suggested changes adobe-type-tools/mapping-resources-pdf#6 (comment), which references https://ccjktype.fonts.adobe.com/2019/05/to-uvs-or-not-to-uvs.html , we all know this work has stalled, but i think the direction is correct and it's a useful provision.)

@crlf0710 crlf0710 added the bug Something isn't correct label Feb 23, 2022
@petervwyatt
Copy link
Member

  1. Good suggestion.

  2. I agree - a PDF is not required to contain text or fonts. Also, the "shall" statement here is also at odds with the softer "should" statement in the very first sentence in 14.8.2.6. Maybe we can switch it around to a more factually worded statement supporting the 14.8.2.6 wording: "Tagged PDF documents that contain text should provide Unicode mappings (see 14.8.2.6 ...)". Also noting that 14.8.2.6 already has a back-reference to 9.10.2.

  3. I'm not sure what concrete suggestion you're making... do you mean that the singular "... producing a Unicode value" should really support plural outputs - as in: "... producing Unicode values"?

@MatthiasValvekens
Copy link
Member

I'm not sure what concrete suggestion you're making... do you mean that the singular "... producing a Unicode value" should really support plural outputs - as in: "... producing Unicode values"?

Yes, that's how I read it as well. Right now, the "predefined" ToUnicode CMaps all happen to have mappings sending one CID to one Unicode codepoint, but that may well change in the future (for the reasons explained in Ken Lunde's post linked by @crlf0710). So I would say that replacing "producing a Unicode value" with "producing a Unicode character sequence" is more appropriate.

I always forget whether "character" or "codepoint" is the term of choice here, so perhaps someone more intimately familiar with the internal conventions of Unicode can comment on that.

@crlf0710
Copy link
Author

3. I'm not sure what concrete suggestion you're making... do you mean that the singular "... producing a Unicode value" should really support plural outputs - as in: "... producing Unicode values"?

Yes, @MatthiasValvekens explained my thought well.

About term of choice above, for text extraction purposes I'd prefer "character" over "codepoint" a bit because this avoid the needs to talk about codepoints not assigned to characters (Surrogate, Noncharacter, Reserved).

@petervwyatt
Copy link
Member

To summarize the proposed solutions:

  1. "A PDF processor can use these methods, in the priority given, to map a character code to a Unicode character sequence".
  2. "Tagged PDF documents that contain text should provide Unicode mappings (see 14.8.2.6 ...)".
  3. "Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode character sequence."

@petervwyatt petervwyatt self-assigned this Feb 28, 2022
@petervwyatt petervwyatt added the proposed solution Proposed solution is ready for review label Feb 28, 2022
@petervwyatt petervwyatt added this to the Font and text related milestone Mar 7, 2022
@petervwyatt
Copy link
Member

PDF TWG identify multiple issues with this clause and would like to remove this section from ISO 32000 in favour of a new more easily updatable document. Will revisit this Issue after that discussion (if necessary).

@MatthiasValvekens
Copy link
Member

Just as a quick update: upon reviewing this clause, WG8 determined that it was only about mapping individual character codes to Unicode values, and not so much about text extraction in general. Its only purpose is to list off possible methods to perform that mapping. As such, the committee concluded that no changes to ISO 32000 were necessary. WG8 also noted that a best practices guide for text extraction would be more at home in a vendor body like the PDF Association, as opposed to a WG8 project.

(I hope that that's a more or less correct summary; I'm going off my recollection of the meeting here, which may or may not be accurate.)

@petervwyatt
Copy link
Member

As noted by @MatthiasValvekens, WG8 desires not to change the current wording via an errata so this issue will be closed as "no Fix".

However, if someone wishes to volunteer to lead new work inside the PDF Association to create a best practices guide for text extraction then please re-open this issue.

@petervwyatt petervwyatt added documentation Improvements or additions to documentation wontfix This issue did not result in any spec changes and removed bug Something isn't correct proposed solution Proposed solution is ready for review labels May 20, 2022
@petervwyatt petervwyatt removed the documentation Improvements or additions to documentation label Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This issue did not result in any spec changes
Projects
None yet
Development

No branches or pull requests

4 participants