-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wording in 9.10.2 is misleading and strictly saying incorrect. #160
Comments
|
Yes, that's how I read it as well. Right now, the "predefined" ToUnicode CMaps all happen to have mappings sending one CID to one Unicode codepoint, but that may well change in the future (for the reasons explained in Ken Lunde's post linked by @crlf0710). So I would say that replacing "producing a Unicode value" with "producing a Unicode character sequence" is more appropriate. I always forget whether "character" or "codepoint" is the term of choice here, so perhaps someone more intimately familiar with the internal conventions of Unicode can comment on that. |
Yes, @MatthiasValvekens explained my thought well. About term of choice above, for text extraction purposes I'd prefer "character" over "codepoint" a bit because this avoid the needs to talk about codepoints not assigned to characters (Surrogate, Noncharacter, Reserved). |
To summarize the proposed solutions:
|
PDF TWG identify multiple issues with this clause and would like to remove this section from ISO 32000 in favour of a new more easily updatable document. Will revisit this Issue after that discussion (if necessary). |
Just as a quick update: upon reviewing this clause, WG8 determined that it was only about mapping individual character codes to Unicode values, and not so much about text extraction in general. Its only purpose is to list off possible methods to perform that mapping. As such, the committee concluded that no changes to ISO 32000 were necessary. WG8 also noted that a best practices guide for text extraction would be more at home in a vendor body like the PDF Association, as opposed to a WG8 project. (I hope that that's a more or less correct summary; I'm going off my recollection of the meeting here, which may or may not be accurate.) |
As noted by @MatthiasValvekens, WG8 desires not to change the current wording via an errata so this issue will be closed as "no Fix". However, if someone wishes to volunteer to lead new work inside the PDF Association to create a best practices guide for text extraction then please re-open this issue. |
Describe the bug
In 9.10.2 Paragraph 1 there's this sentence: "A PDF processor can use these methods, in the priority given, to map a character code to a Unicode value." Should be, "to map a character code to a Unicode character sequence"(following the wording in 9.10.3 ToUnicode CMaps clauses).
This is because it is possible to specify multiple multiple Unicode values or
UTF16-BE
encoding code units in a row inToUnicode
CMaps andmapping-resources-pdf
CMaps, thus a Unicode character sequence. This new wording also covers the AGLFN cases well.Also in paragraph 1, there's this sentence: "Tagged PDF documents, in particular, shall provide at least one of these methods." I believe this sentence is misleading and/or incorrect. First providing methods is the duty of processors not documents. Second, it's totally fine for a Tagged PDF document that doesn't exercise the following methods at all in that section if the document is fully annotated by
AltText
mechanism.In paragraph 9 there's this sentence: "Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.". This wording is correct at the moment but not future-proof. It is definitely possible and useful in the future that
mapping-resources-pdf
CMaps use Unicode character sequences, for cases like UVSs. (Ken Lunde has expressed interest in doing so in the past in [Adobe-Japan1-UCS2] Suggested changes adobe-type-tools/mapping-resources-pdf#6 (comment), which references https://ccjktype.fonts.adobe.com/2019/05/to-uvs-or-not-to-uvs.html , we all know this work has stalled, but i think the direction is correct and it's a useful provision.)The text was updated successfully, but these errors were encountered: