Impossible to mark text in some .PDF's, primarily affecting home-copies of books. #4477

gevvan · 2025-04-26T23:54:17Z

gevvan
Apr 26, 2025

Description of the bug

Hey guys!

Thanks for a great engine.

Now, I filed this bug in the SumatraPDF discussion page just earlier tonight, because at first I thought it was a problem with the (their) viewer. Now, a advanced user there pinpointed that the problem is actually not in SumatraPDF but in MuPDF, this based on that it can be reproduced on any software that is based on MuPDF.

Here is the thread I started on SumatraPDF discussion forum, for reference.
sumatrapdfreader/sumatrapdf#4905

Even if the .pdf is formatted weirdly, I still reckon this is to be an issue (that is hopefully easily fixed) because it does not happen with web browsers, Okular, Abby Reader, or Adobe Reader.

How to reproduce the bug

Open the following .pdf file in any MuPDF-based reader, for example SumatraPDF.

https://sharey.org/files/0X1cv1.pdf

Now, try to mark some text. You'll see that it kind of wants to mark the entire .pdf at once, not word/letter by letter.

PyMuPDF version

1.25.5

Operating system

Windows

Python version

3.9

JorjMcKie · 2025-04-27T12:04:14Z

JorjMcKie
Apr 27, 2025
Maintainer

This is not a bug of PyMuPDF, because we have no dealings with PDF viewers at all.
If at all, you should talk to MuPDF directly, e.g. using their Discord channel.
As you correctly mention, this is a problem caused by the PDF itself: a short look at that file exhibits that the fonts's metric values are ill-specified in almost every imaginable way - both, in the PDF's overwrite specifications, and in the font binaries, too.
The ascender and descender values control the height of the characters above, respectively below the baseline. The ascender in a font file usually should be a value between 0.8 and 1.3, descender typically something like -0.2 to -0.3. When overwritten by its definition in the PDF, these values are multiplied by 1000, so correspondingly are 800 / 1300, respectively -200 / -300.

In this PDF however you will find negative (!) values for ascender in the font files like this font.name='*Minion Pro-Bold-6047 Regular', font.ascender=-32.768001556396484, font.descender=-0.23800000548362732, a complete no-go.
Or PDF overwrite values like the following crazy ones:

33 0 obj
<<
  /Ascent 67306242   # nonsense!
  /AvgWidth 434
  /CapHeight 100992260   # nonsense!
  /Descent -237  # ok
  /Flags 262148
  /FontBBox [ -81 -238 869 758 ]
  /FontFile3 36 0 R
  /FontName /*Minion#20Pro-Bold-6047
  /ItalicAngle 0
  /Lang /SV
  /MaxWidth 846
  /StemV 100
  /Type /FontDescriptor
  /XHeight 84149251   # nonsense!
>>
endobj

For meaningful text position extraction (the basis for all PDF viewers), either of the two value sets must be taken.
If both are nonsense, then only guesswork can help. For example by taking the font names, looking up reasonable values in normal / well-formed versions.
We do not do this.

0 replies

JorjMcKie · 2025-04-27T12:43:38Z

JorjMcKie
Apr 27, 2025
Maintainer

You can try and do a simplified guessing. The following script modifies the PDF by writing typical values into the applicable PDF objects.
The output PDF should behave much better in terms of boundary box values for words or text spans / characters:

import pymupdf
from pymupdf import mupdf

doc = pymupdf.open("input.pdf")
pdfdoc = pymupdf._as_pdf_document(doc)  # access underlying PDF document
for xref in range(1, doc.xref_length()):  # walk over all PDF objects
    try:  # skip problem xref numbers
        text = doc.xref_object(xref)
    except:
        continue
    if not "/FontDescriptor" in text:  # only modify FontDescriptors
        continue
    obj = mupdf.pdf_load_object(pdfdoc, xref)  # load PDF object
    # put typical values in Ascent, Descent, etc.
    obj.pdf_dict_put_int(pymupdf.PDF_NAME("Ascent"), 1000)  # experiment also with 800 etc
    obj.pdf_dict_put_int(pymupdf.PDF_NAME("Descent"), -200) # experiment also with -250 etc
    obj.pdf_dict_del(pymupdf.PDF_NAME("XHeight"))  # make sure to remove nonsense
    obj.pdf_dict_del(pymupdf.PDF_NAME("CapHeight")) # dito

doc.save("x.pdf")

1 reply

JorjMcKie Apr 27, 2025
Maintainer

A text extraction by "words" for the modified file looks like this:

Visibly, the value 1000 is a bit too large (900 better?) and -200 probably should better be -230, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Impossible to mark text in some .PDF's, primarily affecting home-copies of books. #4477

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Impossible to mark text in some .PDF's, primarily affecting home-copies of books. #4477

Uh oh!

gevvan Apr 26, 2025

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Replies: 2 comments · 1 reply

Uh oh!

JorjMcKie Apr 27, 2025 Maintainer

Uh oh!

JorjMcKie Apr 27, 2025 Maintainer

Uh oh!

JorjMcKie Apr 27, 2025 Maintainer

gevvan
Apr 26, 2025

Replies: 2 comments 1 reply

JorjMcKie
Apr 27, 2025
Maintainer

JorjMcKie
Apr 27, 2025
Maintainer

JorjMcKie Apr 27, 2025
Maintainer