Impossible to mark text in some .PDF's, primarily affecting home-copies of books. #4477
Replies: 2 comments 1 reply
-
This is not a bug of PyMuPDF, because we have no dealings with PDF viewers at all. In this PDF however you will find negative (!) values for ascender in the font files like this 33 0 obj
<<
/Ascent 67306242 # nonsense!
/AvgWidth 434
/CapHeight 100992260 # nonsense!
/Descent -237 # ok
/Flags 262148
/FontBBox [ -81 -238 869 758 ]
/FontFile3 36 0 R
/FontName /*Minion#20Pro-Bold-6047
/ItalicAngle 0
/Lang /SV
/MaxWidth 846
/StemV 100
/Type /FontDescriptor
/XHeight 84149251 # nonsense!
>>
endobj For meaningful text position extraction (the basis for all PDF viewers), either of the two value sets must be taken. |
Beta Was this translation helpful? Give feedback.
-
You can try and do a simplified guessing. The following script modifies the PDF by writing typical values into the applicable PDF objects. import pymupdf
from pymupdf import mupdf
doc = pymupdf.open("input.pdf")
pdfdoc = pymupdf._as_pdf_document(doc) # access underlying PDF document
for xref in range(1, doc.xref_length()): # walk over all PDF objects
try: # skip problem xref numbers
text = doc.xref_object(xref)
except:
continue
if not "/FontDescriptor" in text: # only modify FontDescriptors
continue
obj = mupdf.pdf_load_object(pdfdoc, xref) # load PDF object
# put typical values in Ascent, Descent, etc.
obj.pdf_dict_put_int(pymupdf.PDF_NAME("Ascent"), 1000) # experiment also with 800 etc
obj.pdf_dict_put_int(pymupdf.PDF_NAME("Descent"), -200) # experiment also with -250 etc
obj.pdf_dict_del(pymupdf.PDF_NAME("XHeight")) # make sure to remove nonsense
obj.pdf_dict_del(pymupdf.PDF_NAME("CapHeight")) # dito
doc.save("x.pdf") |
Beta Was this translation helpful? Give feedback.
-
Description of the bug
Hey guys!
Thanks for a great engine.
Now, I filed this bug in the SumatraPDF discussion page just earlier tonight, because at first I thought it was a problem with the (their) viewer. Now, a advanced user there pinpointed that the problem is actually not in SumatraPDF but in MuPDF, this based on that it can be reproduced on any software that is based on MuPDF.
Here is the thread I started on SumatraPDF discussion forum, for reference.
sumatrapdfreader/sumatrapdf#4905
Even if the .pdf is formatted weirdly, I still reckon this is to be an issue (that is hopefully easily fixed) because it does not happen with web browsers, Okular, Abby Reader, or Adobe Reader.
How to reproduce the bug
Open the following .pdf file in any MuPDF-based reader, for example SumatraPDF.
https://sharey.org/files/0X1cv1.pdf
Now, try to mark some text. You'll see that it kind of wants to mark the entire .pdf at once, not word/letter by letter.
PyMuPDF version
1.25.5
Operating system
Windows
Python version
3.9
Beta Was this translation helpful? Give feedback.
All reactions