I am not sure if this is a bug. #3797
Replies: 3 comments
-
The attached PDF is different from the attached image! This clearly is no error and I also see no basis for whatever "enhancement". |
Beta Was this translation helpful? Give feedback.
-
I am talking about text extraction. You will find 'A194/C194 Cu Alloy' and 'Sample Name' are not extracted in the same line if you look at RED line 2 of reference image. |
Beta Was this translation helpful? Give feedback.
-
That too is not a bug but a technical peculiarity of MuPDF. You need your own code to recover lines that roughly like the ones visible. But there is example code that can be used for this: import pymupdf
# import a helper method from sister package
from pymupdf4llm.helpers.get_text_lines import get_text_lines
doc = pymupdf.open("test.pdf")
page = doc[0]
text = get_text_lines(page)
print(text) This produces the following output:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have a sample PDF. Hope that thse 5 interested lines can be extracted correctly and displayed correctly

(please refer to the RED underlined of attached PNG file)
The sample PDF file can be found here.
https://www.nxp.com/testreports/360000002263_CDA_194_ZHM_A_HLGN.pdf
(update sample PDF)
Beta Was this translation helpful? Give feedback.
All reactions