Performant calculation of image and text element page coverage #1584
Replies: 18 comments 1 reply
-
Glad to hear that!
Please look at this example, where I did a layout analysis of a fairly complicated PDF (CropBox != MediaBox, image and text areas not always inside page rectangle and overlapping each other, etc.). The gray rectangles are the visible pages ( You definitely do not need Numpy or whatever to do this efficiently:
The result of these steps is the text area of the page. Do the same thing with image blocks, except that each image rect area also gets reduced by any overlap it may have with any text area. The result of this algorithm is the image are of the page. Hope I have not missed any wrinkle ... |
Beta Was this translation helpful? Give feedback.
-
Thanks that is a good tip that those are implemented as a C function. We been using I didn't realize till you pointed it out with your example how the text box can differ so far from where the inner text is displayed. Regarding the approach you outline (assuming rects are sorted by left/right/top/down bounds) My concern here is that that a portion of |
Beta Was this translation helpful? Give feedback.
-
Ok, sequence is A, B, C.
|
Beta Was this translation helpful? Give feedback.
-
The above gave me |
Beta Was this translation helpful? Give feedback.
-
I am aware that my algorithm is a first sketch only and not yet optimized. |
Beta Was this translation helpful? Give feedback.
-
No, I am wrong: |
Beta Was this translation helpful? Give feedback.
-
Took me a while to think about the correct algorithm ... embarrassingly:
Finally build the alternating sum to get the area of the rectangle union I am not acquainted with Shapely, but I have quite some experience with Numpy. |
Beta Was this translation helpful? Give feedback.
-
I have been experimenting a bit. My thoughts above are correct (I hope), but in real life we probably do not need to care about situations where 4 or more blocks overlap each other - taking up to 3 overlaps into account should be enough. The following script is exact in that case:
produces this output:
The investigated page has 19 text blocks and 1 image block.
so this at least shows that Shapely could be beaten by pure Python in this constellation. |
Beta Was this translation helpful? Give feedback.
-
And here is a version without using itertools. This delivers the same results but at a much, much higher speed:
With this method, time grows O(n**2) with the number of text blocks - itertools much more. Protocol:
It should also be fairly easy to add 4-fold area overlaps whenever needed. |
Beta Was this translation helpful? Give feedback.
-
Replacing the
|
Beta Was this translation helpful? Give feedback.
-
Out of curiosity, I have re-implemented function
When this C code is compiled as a Python extension (giving I am attaching a ZIP file containing 3 Python scripts implementing the same functionality in different ways:
Two additional are are contained in the ZIP: My tests so far are showing impressive results:
I was a little surprised. Obviously, the interface code between Python and C is adding too much overhead compared to the rather trivial functionality executed on the C-level. Anyway - I am interested in your reaction! |
Beta Was this translation helpful? Give feedback.
-
That code is great thanks! I generally have avoided extra Cython dependancies since pre-building these for the AWS Lambda environment is a pain BUT the performance gain you have there makes this case worth it. Thanks. |
Beta Was this translation helpful? Give feedback.
-
But you could take the provided C-code directly as is and compile it. As it lies around there, it has no more dependencies to Cython. Only Python includes & libs are required. |
Beta Was this translation helpful? Give feedback.
-
ok, to close the issue now? |
Beta Was this translation helpful? Give feedback.
-
Adjust the code until it works. |
Beta Was this translation helpful? Give feedback.
-
Hi, I am using this script to measure the ratio of images and text in a pdf file. It usually works as expected. At the top of the file, "DEBITEUR(S) LEGAl(AUX)" does not seem to be a text (with fonts) but vectorized. So I was expecting these areas to be counted as images. |
Beta Was this translation helpful? Give feedback.
-
You are right: lots of apparent text parts are vector graphics. |
Beta Was this translation helpful? Give feedback.
-
I see. You have an idea about a possible way to estimate the area of vector graphics? Here is the part of your script that detects if a block is an image or a text.
|
Beta Was this translation helpful? Give feedback.
-
I very much like the new
Page.getTextBlocks(images=True)
, thank you for adding that :). One of the most important uses for us is to to calculate the% of the page area
covered by an image blocks% of the page area
covered by text blocks.We need these derived values to assume (with a threshold) if the page needs to be processed with OCR as part of our pipeline. We are calculating the total union area of rectangle blocks using packages with C bindings like using Numpy or Shapely however we hate having these requirements ALTHOUGH doing this in straight python could be much slower.
This is general
feature request
: it would be nice to have metrics like this (with a highly performant implementation) as part of the page modelBeta Was this translation helpful? Give feedback.
All reactions