Performant calculation of image and text element page coverage #1584

juviwhale · 2018-01-17T00:34:59Z

juviwhale
Jan 17, 2018

I very much like the new Page.getTextBlocks(images=True), thank you for adding that :). One of the most important uses for us is to to calculate the

% of the page area covered by an image blocks
and % of the page area covered by text blocks.

We need these derived values to assume (with a threshold) if the page needs to be processed with OCR as part of our pipeline. We are calculating the total union area of rectangle blocks using packages with C bindings like using Numpy or Shapely however we hate having these requirements ALTHOUGH doing this in straight python could be much slower.

This is general feature request: it would be nice to have metrics like this (with a highly performant implementation) as part of the page model

JorjMcKie · 2018-01-17T01:52:13Z

JorjMcKie
Jan 17, 2018
Maintainer

Glad to hear that!
Your request is a little more complex than it looks at first sight, because all the block rectangles may overlap each other. So, what probably needs to be done is

setting a priority between areas covered by text and areas covered by images. All of them may overlap each other "chaotically".
Assuming text area priority precedes images areas: calculate the overall text area (an "amalgamated sum"), then calculate the overall image area (again an amalgamated sum) that does not intersect with any text area.

Please look at this example, where I did a layout analysis of a fairly complicated PDF (CropBox != MediaBox, image and text areas not always inside page rectangle and overlapping each other, etc.). The gray rectangles are the visible pages (CropBox), text blocks have a blue, image blocks have a red border.

You definitely do not need Numpy or whatever to do this efficiently:

Make a list of all non-empty text rectangle intersections with the page rectangle ( abs(rect & page.rect) > 0, intersection operator & is implemented as a C function).
Calculate their area by stepping through this list and subtracting any overlap area a rect has with predecessors.

The result of these steps is the text area of the page.

Do the same thing with image blocks, except that each image rect area also gets reduced by any overlap it may have with any text area. The result of this algorithm is the image are of the page.

Hope I have not missed any wrinkle ...

0 replies

juviwhale · 2018-01-17T02:33:20Z

juviwhale
Jan 17, 2018
Author

Thanks that is a good tip that those are implemented as a C function. We been using Shapely's cascaded_union(list_of_rects) after normalizing the rects to in_page constraints to get an "amalgamated sum". We have been doing it this way because have read thatShapelyuses an efficientsweep algorithm` to reduce the complexity. However, we are seeing CPU times of ~50ms per page which seems high.

I didn't realize till you pointed it out with your example how the text box can differ so far from where the inner text is displayed.

Regarding the approach you outline 2. Calculate their area by stepping through this list and subtracting any overlap area a rect has with predecessors. wouldn't this not be successful for some edge cases? e.g.

(assuming rects are sorted by left/right/top/down bounds)

My concern here is that that a portion of B would be subtracted but not of A

0 replies

JorjMcKie · 2018-01-17T02:46:05Z

JorjMcKie
Jan 17, 2018
Maintainer

Ok, sequence is A, B, C.

Take area A
Take area B minus intersection with A (which is 0), add remainder to total area, hence area A + area B so far.
Take area C minus area of intersection rect of C and A, minus intersection rect of C and B
Add the rest to total (text) area.
Should be good I suppose. Here is a script I just tried with that mentioned complex case:

doc = fitz.open("demo1.pdf")
page = doc[0]
prect = page.rect
page_area = abs(prect)                      # page area
text_area = 0
img_area  = 0

blks = page.getTextBlocks(images=True)
ttab = []
itab = []
for b in blks:
    r = fitz.Rect(b[:4]) & prect            # intersection w/ page rect
    if abs(r) > 0:
        if b[-1] == 0:
            ttab.append(r)             # store relevant rect wih its type
        else:
            itab.append(r)

text_area = 0
img_area  = 0
for i, r in enumerate(ttab):
    a = abs(r)
    for j in range(i-1, -1, -1):
        a -= abs(r & ttab[j])
    if a > 0:
        text_area += a

for i, r in enumerate(itab):
    a = abs(r)
    for j in range(i-1, -1, -1):
        a -= abs(r & itab[j])
    for s in ttab:
        a -= abs(r & s)
    if a > 0:
        img_area += a

print("text area %g" % (text_area*100/page_area))
print("image area %g" % (img_area*100/page_area))

0 replies

JorjMcKie · 2018-01-17T02:46:57Z

JorjMcKie
Jan 17, 2018
Maintainer

The above gave me
text area 45.8427 %
image area 34.4839 %

0 replies

JorjMcKie · 2018-01-17T02:53:04Z

JorjMcKie
Jan 17, 2018
Maintainer

I am aware that my algorithm is a first sketch only and not yet optimized.

0 replies

JorjMcKie · 2018-01-17T03:09:27Z

JorjMcKie
Jan 17, 2018
Maintainer

No, I am wrong:
The delta area coming from C should be:
abs(C) - abs(C & B) - abs(C & A) + abs(C & B & A)
because A and B may overlap too, and this intersection with C would be subtracted twice with the above.

0 replies

JorjMcKie · 2018-01-17T13:00:26Z

JorjMcKie
Jan 17, 2018
Maintainer

Took me a while to think about the correct algorithm ... embarrassingly:
To calculate the area of a union of, say, 10 rectangles, you need to create a 10-array arr = [0.0] * 10 of floats and then calculate:

the sum of all 10 rectangle areas and put inarr[0]
the sum of all intersections of 2 rectangles => arr[1]
the sum of all triple intersections => arr[2]
...

Finally build the alternating sum to get the area of the rectangle union area = arr[0] - arr[1] + arr[2] - ... -arr[9].
So for n rectangles, 2**n - n - 1 intersection areas must be calculated and be put in the right slot of an n-array of floats. The alternating sum of the items of this array is our area of rectangle unions.

I am not acquainted with Shapely, but I have quite some experience with Numpy.
I still believe that an algorithm like the above sketch should be efficient enough in Python though. After all, the number n of text blocks in a page should be small, a 1-digit or small 2-digit number. As a last resort, the whole thing could be refactored using Cython - as per my experience absolutely sufficient to achieve C-performance out of numeric-only Python code.

0 replies

JorjMcKie · 2018-01-17T17:43:28Z

JorjMcKie
Jan 17, 2018
Maintainer

I have been experimenting a bit. My thoughts above are correct (I hope), but in real life we probably do not need to care about situations where 4 or more blocks overlap each other - taking up to 3 overlaps into account should be enough. The following script is exact in that case:

from __future__ import print_function
import time
import fitz
from itertools import combinations
def unionArea(rlist):
    nolaps = 3      # number of overlaps we are regarding
    arr = [0.0] * nolaps
    arr[0] = sum([abs(r) for r in rlist])
    for i in range(2, nolaps + 1):
        for lst in combinations(rlist, i):
            r0 = lst[0]
            for r in lst[1:]:
                r0 &= r
                if r0 == fitz.Rect():  # this rect is empty
                    break
            arr[i - 1] += abs(r0)
    area = 0
    f = 1.
    for a in arr:
        area += a * f
        f *= -1
    return area

doc = fitz.open("demo1.pdf")
page = doc[0]
parea = abs(page.rect)
tlist = []
ilist = []
for b in page.getTextBlocks(images = True):
    r = fitz.Rect(b[:4]) & page.rect
    if abs(r) > 0:
        if b[-1] == 0:
            tlist.append(r) # a text block
        else:
            ilist.append(r) # an image block
t0 = time.clock()
tarea = unionArea(tlist)
t1 = time.clock()
area = unionArea(tlist + ilist) # I know, this is wasting some time. But see below
t2 = time.clock()
iarea = area - tarea            # image area as the delta to text area
print("text area = %g, time = %g sec" % (tarea, (t1-t0)))
print("text percentage = %g%%" % (round(tarea*100/parea, 2)))
print("image percentage = %g%%" % (round(iarea*100/parea, 2)))
print("page area = %g" % parea)

produces this output:

text area = 188375, time = 0.073701 sec
text percentage = 45.84%
image percentage = 34.61%
page area = 410916

The investigated page has 19 text blocks and 1 image block.
I am aware that 73ms are beyond your tolerable threshold. On the other hand:

I'm only using pure Python
using itertools for sure is not the most performant way

so this at least shows that Shapely could be beaten by pure Python in this constellation.

0 replies

JorjMcKie · 2018-01-17T20:02:28Z

JorjMcKie
Jan 17, 2018
Maintainer

And here is a version without using itertools. This delivers the same results but at a much, much higher speed:

import time
import fitz
def unionArea(rlist):
    arr = [0.0] * 3
    area = 0
    arr[0] = sum([abs(r) for r in rlist])
    for i, r in enumerate(rlist, start=1):
        arr[1] += sum([abs(r & s) for s in rlist[i:]])
    for i, r in enumerate(rlist, start=1):
        for j, s in enumerate(rlist[i:], start=1):
            t = r & s
            if abs(t) > 0: # <== this is the major performance gain!
                arr[2] += sum([abs(t & u) for u in rlist[i+j:]])

    f = 1.
    for a in arr:
        area += a * f
        f *= -1
    return area

doc = fitz.open("demo1.pdf")
page = doc[0]
parea = abs(page.rect)
tlist = []
ilist = []
for b in page.getTextBlocks(images = True):
    r = fitz.Rect(b[:4]) & page.rect
    if abs(r) > 0:
        if b[-1] == 0:
            tlist.append(r) # a text block
        else:
            ilist.append(r) # an image block
t0 = time.clock()
tarea = unionArea(tlist)
t1 = time.clock()
area = unionArea(tlist + ilist)
t2 = time.clock()
iarea = area - tarea
print("text area = %g, time = %g sec" % (tarea, (t1-t0)))
print("text percentage = %g%%" % (round(tarea*100/parea, 2)))
print("image percentage = %g%%" % (round(iarea*100/parea, 2)))
print("page area = %g" % parea)

With this method, time grows O(n**2) with the number of text blocks - itertools much more. Protocol:

text area = 188375, time = 0.0117736 sec
text percentage = 45.84%
image percentage = 34.61%
page area = 410916

It should also be fairly easy to add 4-fold area overlaps whenever needed.

0 replies

JorjMcKie · 2018-01-18T11:50:57Z

JorjMcKie
Jan 18, 2018
Maintainer

Replacing the unionArea function above with the following again doubles speed - at least for the example I used:

def unionArea(rlist):
    arr_0 = arr_1 = arr_2 = 0.0
    for i, r in enumerate(rlist, start=1):
        arr_0 += abs(r)
        for j, s in enumerate(rlist[i:], start=1):
            t = r & s
            abs_t = abs(t)
            if abs_t > 0:
                arr_1 += abs_t
                arr_2 += sum([abs(t & u) for u in rlist[i+j:]])

    return arr_0 - arr_1 + arr_2

0 replies

JorjMcKie · 2018-01-19T01:47:28Z

JorjMcKie
Jan 19, 2018
Maintainer

Out of curiosity, I have re-implemented function unionArea in the following way:

removed any PyMuPDF dependency by replacing the abs and & operators by pure Python code
respecting up to 4-fold area overlaps, so results become (slightly) incorrect if more than 4 overlaps are present.
put the function in a separate Cython source file (extention .pyx) and generate C code from it.

When this C code is compiled as a Python extension (giving unionArea.so, resp. unionArea.pyd in Windows), the function can be imported with from unionArea import unionArea.

I am attaching a ZIP file containing 3 Python scripts implementing the same functionality in different ways:

aArea.py - uses unionArea() like in my previous post (except 4 overlaps are supported)
aAreaX.py - uses a function with removed PyMuPDF dependency. Therefore a list of float quadruples must be passed to unionArea(), not fitz rectangles.
aArea-PYX.py - like aAreaX.py but unionArea is now implemented as a C function.

Two additional are are contained in the ZIP: unionArea.pyx, the Cython source, and unionArea.c its generated C-source.

My tests so far are showing impressive results:

aArea-PYX.py is 5 to 6 times faster than its pure Python version, and more than 80 (!) times faster than aArea.py
aAreaX.py is about 15 times faster when no use of PyMuPDF is made in unionArea.

I was a little surprised. Obviously, the interface code between Python and C is adding too much overhead compared to the rather trivial functionality executed on the C-level. Anyway - I am interested in your reaction!
unionArea.zip

0 replies

juviwhale · 2018-01-19T21:46:06Z

juviwhale
Jan 19, 2018
Author

That code is great thanks! I generally have avoided extra Cython dependancies since pre-building these for the AWS Lambda environment is a pain BUT the performance gain you have there makes this case worth it.

Thanks.

0 replies

JorjMcKie · 2018-01-19T22:25:29Z

JorjMcKie
Jan 19, 2018
Maintainer

But you could take the provided C-code directly as is and compile it. As it lies around there, it has no more dependencies to Cython. Only Python includes & libs are required.
Forgot to mention: the binary resulting from this compilation can just be copied into the script directory - nothing needs to be installed or whatever.

0 replies

JorjMcKie · 2018-01-19T22:25:56Z

JorjMcKie
Jan 19, 2018
Maintainer

ok, to close the issue now?

0 replies

JorjMcKie · 2022-02-04T13:06:15Z

JorjMcKie
Feb 4, 2022
Maintainer

Adjust the code until it works.

0 replies

fareshan · 2025-02-11T18:24:32Z

fareshan
Feb 11, 2025

Hi,

I am using this script to measure the ratio of images and text in a pdf file. It usually works as expected.
But I have this file, where I do not understand why it says images area = 0
2nd page.pdf

At the top of the file, "DEBITEUR(S) LEGAl(AUX)" does not seem to be a text (with fonts) but vectorized.
It is the same for "Propriétés bâties" on the left side

So I was expecting these areas to be counted as images.

0 replies

JorjMcKie · 2025-02-11T20:42:28Z

JorjMcKie
Feb 11, 2025
Maintainer

You are right: lots of apparent text parts are vector graphics.
Vector graphics are no images, so will not appear as embedded, extractable files as you will see when executing page.get_images()) etc.
The page (directly or indirectly) will contain "atomic" draw commands instead, which you can make yourself as well with the draw_*() methods of the shape class.

0 replies

fareshan · 2025-02-11T20:50:34Z

fareshan
Feb 11, 2025

I see. You have an idea about a possible way to estimate the area of vector graphics?

Here is the part of your script that detects if a block is an image or a text.

parea = abs(page.rect)
tlist = []
ilist = []
for b in page.getTextBlocks(images = True):
    r = fitz.Rect(b[:4]) & page.rect
    if abs(r) > 0:
        if b[-1] == 0:
            tlist.append(tuple(r)) # a text block
        else:
            ilist.append(tuple(r)) # an image block

1 reply

JorjMcKie Feb 11, 2025
Maintainer

No, his code snippet only looks at text and images bboxes.
paths=page.get_drawings() extracts vector graphics, see documentation get_drawings.

Performant calculation of image and text element page coverage #1584

Uh oh!

juviwhale Jan 17, 2018

Replies: 18 comments · 1 reply

Uh oh!

JorjMcKie Jan 17, 2018 Maintainer

Uh oh!

juviwhale Jan 17, 2018 Author

Uh oh!

JorjMcKie Jan 17, 2018 Maintainer

Uh oh!

JorjMcKie Jan 17, 2018 Maintainer

Uh oh!

JorjMcKie Jan 17, 2018 Maintainer

Uh oh!

JorjMcKie Jan 17, 2018 Maintainer

Uh oh!

JorjMcKie Jan 17, 2018 Maintainer

Uh oh!

JorjMcKie Jan 17, 2018 Maintainer

Uh oh!

Uh oh!

JorjMcKie Jan 17, 2018 Maintainer

Uh oh!

JorjMcKie Jan 18, 2018 Maintainer

Uh oh!

JorjMcKie Jan 19, 2018 Maintainer

Uh oh!

juviwhale Jan 19, 2018 Author

Uh oh!

Uh oh!

JorjMcKie Jan 19, 2018 Maintainer

Uh oh!

JorjMcKie Jan 19, 2018 Maintainer

Uh oh!

JorjMcKie Feb 4, 2022 Maintainer

Uh oh!

fareshan Feb 11, 2025

Uh oh!

JorjMcKie Feb 11, 2025 Maintainer

Uh oh!

fareshan Feb 11, 2025

Uh oh!

JorjMcKie Feb 11, 2025 Maintainer

juviwhale
Jan 17, 2018

Replies: 18 comments 1 reply

JorjMcKie
Jan 17, 2018
Maintainer

juviwhale
Jan 17, 2018
Author

JorjMcKie
Jan 17, 2018
Maintainer

JorjMcKie
Jan 17, 2018
Maintainer

JorjMcKie
Jan 17, 2018
Maintainer

JorjMcKie
Jan 17, 2018
Maintainer

JorjMcKie
Jan 17, 2018
Maintainer

JorjMcKie
Jan 17, 2018
Maintainer

JorjMcKie
Jan 17, 2018
Maintainer

JorjMcKie
Jan 18, 2018
Maintainer

JorjMcKie
Jan 19, 2018
Maintainer

juviwhale
Jan 19, 2018
Author

JorjMcKie
Jan 19, 2018
Maintainer

JorjMcKie
Jan 19, 2018
Maintainer

JorjMcKie
Feb 4, 2022
Maintainer

fareshan
Feb 11, 2025

JorjMcKie
Feb 11, 2025
Maintainer

fareshan
Feb 11, 2025

JorjMcKie Feb 11, 2025
Maintainer