Is there any way that I can identify whether the PDF is edited/tampered and the exact location where the PDF is edited/tampered using Python? #892

AbhishekTanksali · 2021-02-09T04:02:32Z

AbhishekTanksali
Feb 9, 2021

Hi Folks,

I am working on identifying forgery/tampering in bank statements PDF documents. Info metadata and XMP metadata is not always present in the PDFs that I have so I am not able to create any generalized rule to identify tampered PDFs. I am using Python libraries such as PyMuPDF, PDFMiner, PyPDF2 etc.

I have 2 questions:

Is there any concrete way to identify whether the PDF is tampered (using PyMuPDF/Python/any other opensource technology) ?
If the PDF is tampered then which part of the PDF has been tampered (using PyMuPDF/Python/any other opensource technology)?
Attaching 2 PDFs for reference -

original :- "sbi statment_out2.pdf" link - https://drive.google.com/file/d/1DoWAKYcCudRO-Cwjbgf7RjiJUsF3DD3s/view?usp=sharing

Tampered using Sejda online editor :- "sbi statment_out2_Sejda_edited.pdf link - https://drive.google.com/file/d/1J4eRy9tO3jN8AqEWNrKXtn40G6vdH5G3/view?usp=sharing

In tempered PDF, I have edited '2,412.00' under 'Credit' column to '12.00'.

Kindly let me know in case any open source solution, preferably in Python.

Thanks.

Answered by JorjMcKie

Feb 9, 2021

In your example cases, the answer to the question "Are the PDFs equal?" is simple.
You can use the file creation / modification timestamps to answer this, and / or the file sizes. Best use the Python built-in module os here.

On the PDF level, you can look at some core indicators:

>>> import fitz
>>> doc1 = fitz.open("sbi statment_out2.pdf")
>>> doc2 = fitz.open("sbi statment_out2_Sejda_edited.pdf")
>>> from pprint import pprint
>>> pprint(doc1.metadata)
{'author': '',
 'creationDate': "D:20200911140637+05'30'",
 'creator': '',
 'encryption': None,
 'format': 'PDF 1.4',
 'keywords': '',
 'modDate': "D:20200911140637+05'30'",
 'producer': 'iText 2.0.4 (by lowagie.com)',
 'subject': '',
 'ti…

View full answer

JorjMcKie · 2021-02-09T11:20:08Z

JorjMcKie
Feb 9, 2021
Maintainer

In your example cases, the answer to the question "Are the PDFs equal?" is simple.
You can use the file creation / modification timestamps to answer this, and / or the file sizes. Best use the Python built-in module os here.

On the PDF level, you can look at some core indicators:

>>> import fitz
>>> doc1 = fitz.open("sbi statment_out2.pdf")
>>> doc2 = fitz.open("sbi statment_out2_Sejda_edited.pdf")
>>> from pprint import pprint
>>> pprint(doc1.metadata)
{'author': '',
 'creationDate': "D:20200911140637+05'30'",
 'creator': '',
 'encryption': None,
 'format': 'PDF 1.4',
 'keywords': '',
 'modDate': "D:20200911140637+05'30'",
 'producer': 'iText 2.0.4 (by lowagie.com)',
 'subject': '',
 'title': '',
 'trapped': ''}
>>> pprint(doc2.metadata)
{'author': '',
 'creationDate': "D:20200911140637+05'30'",
 'creator': 'sejda.com (4.1.7)',
 'encryption': None,
 'format': 'PDF 1.5',
 'keywords': '',
 'modDate': "D:20210209042635+01'00'",
 'producer': 'SAMBox 2.2.12',
 'subject': '',
 'title': '',
 'trapped': ''}

As you can see, a lot of parameters were changed when you created the modified file.

The PDF trailer contains an ID key with two items. The first item is created together with the PDF itself - and then never again modified. The second item is modified with each file modification. So if the two are equal, then this means, the PDF has never been modified.

>>> print(doc1.pdf_trailer())
<<
  /Size 23
  /Info 10 0 R
  /Root 9 0 R
  /ID [ <972B505462C2D50EFA775DC2D33B032D> <F5C4CAA260BAF2A9A812728CBBE31A03> ]
>>
>>> print(doc2.pdf_trailer())
<<
  /Size 25
  /Info 3 0 R
  /Root 2 0 R
  /ID [ <44C24EDE24FE0529FCEC5A78BD6CBCE9> <44C24EDE24FE0529FCEC5A78BD6CBCE9> ]
  /Type /XRef
  /Index [ 0 25 ]
  /W [ 1 2 2 ]
  /DL 125
  /Filter /FlateDecode
  /Length 85
>>
>>>

You see (again) that your modification has created a completely new file, and therefore has a different first item than the original.
Your new file itself has not yet been changed, because item 1 == item 2.

So much for the question "Have changes occurred at all?".
To answer the question "What are the changes?", things are more challenging - continue with next post.

8 replies

JorjMcKie Feb 9, 2021
Maintainer

If you do not have 2 PDFs to compare, then things are a lot more difficult - and sometimes impossible to answer. Let's see:

...where the forgery happened.

Did you mean, the fraud - if it is a fraud - would have happened there?

You (or the client) can always check the last change date/time of the PDF file,
You can look at the PDF trailer to see if the two ID fields are different as I wrote above.
You can print the page's text. If the forgery happened with the usual techniques, the manipulated data should be found at the end of the existing text. Let me demonstrate with your forged example:

>>> import fitz
>>> doc=fitz.open("sbi statment_out2_Sejda_edited.pdf")
>>> page=doc[0]
>>> blocks = page.get_text("dict", flags=0)["blocks"]
>>> spans=[]
>>> for b in blocks:
	for l in b["lines"]:
		for s in l["spans"]:
			spans.append((s["bbox"], s["text"]))

			
>>> import json
>>> out=open("spans.json", "w")
>>> out.write(json.dumps(spans))
13905
>>> out.close()
>>>

The produced list spans looks like this:
spans.zip

Your forgery has been stored at the end of the previously existing text of the page, and therefore can be found at the bottom of that list - together with the coordinates of the rectangle, into which you have stored the fraudulent text "12.00":

The four preceeding floats are the Text's top-left, resp. bottom-right rectangle corners. One can also see, that the "12.00" was (inaccurately) put to the left of this amount:

AbhishekTanksali Feb 10, 2021
Author

Hello Sir,

Thanks for your reply.

I have seen that if the PDF is just compressed using online website such as ilovepdf.com, then also it modifies the ModDate in Info metadata. So here I haven't modified any text, just compressed the pdf to reduce its size.
so presence of ModDate in metadata doesn't guarantee that the PDF is 100% edited. It can be compressed as well.
Also I have seen that Trailer ID fields in the original PDF (of bank statement) can be different. Atleast original PDF bank statement which I have, it shows different trailer IDs. So different trailer ID fields don't guarantee if the PDF is tampered.
Also I have seen Prev tag of trailer is also not consistently inserted in the edited PDF. When I tried editing the PDF using smallpdf.com, it adds Prev tag in trailer. But when I try editing using ilovepdf.com, it doesn't add any Prev tag. so Prev tag also not a concrete indicator if the PDF is edited.

Is there by any way possible to identify - if the PDF is just compressed vs if the PDF is indeed edited?

If we can figure out if the PDF is indeed edited then with the help of 'span' logic you showed above, we can safely assume that at least last entry in the span json is the edited entry.

Kindly guide. Your guidance will be of paramount important.
Thanks.

AbhishekTanksali Feb 10, 2021
Author

Hello Sir,

Few things I observed as below:

[A]
The blocks(BBOX and text) in span json output for the original bank PDF are not in proper order.

    My original PDF is divided into 3 sections as below:
    1. header info such as cutomer name, customer address, branch address, account number 
    2. Actual bank transaction table
    3. Footer info
    
    Span BBOX output starts with 2nd section first i.e. the transaction table. Then it starts with the header and then footer.
    
    Does that mean the PDF was originally created with first printing the transaction table(2nd section) then header (1st section) and finally footer (section 3)?

[B]
The original PDF had trailer and info metadata when I checked with VIM.
When I used sejda.com editor to edit the amount column and generated the tampered PDF, it's trailer and Info metadata have disappeared when checked with VIM.

  In this case it even not possible to get modification date of the info metadata.
  Any guidance on this?

[C]
When I edited an original PDF using ilovepdf.com, the edited text should have been at the bottom of the span json as you highlighted previously.
But in my case the edited text is not comming at the end. It's coming exactly at the place where it should be based on top_start.
So understanding you provided regarding span josn is not generalizing on all the edited PDFs.
Any thoughts on this?

Thanks.

JorjMcKie Feb 10, 2021
Maintainer

Does that mean the PDF was originally created with first printing the transaction table(2nd section) then header (1st section) and finally footer (section 3)?

This reflects indeed the sequence in which these these items have been inserted in the PDF. Very normal procedure. Also very normal difficulty when you want to see extracted text in "natural" reading sequence: you must first apply some sorting logic.

it's trailer and Info metadata have disappeared

No they have not disappeared. They were just put in some compressed object by the sejda.com editor. Using PyMuPDF you can still see those data, because it is able to decompress those items.

in my case the edited text is not comming at the end

Well, that may happen, too.

Let me stress again:
Just by looking at one given file, there is no way on earth to confirm it has been tampered.
I have enumerated a handful of indicators that may raise suspicions - not more. As you wrote: all of them may also be caused by technical changes or optimizations.

AbhishekTanksali Feb 10, 2021
Author

Thank you Sir for your guidance. Really helpful.
Grateful to you!

Regards,
Abhishek

JorjMcKie · 2021-02-09T11:59:39Z

JorjMcKie
Feb 9, 2021
Maintainer

Probably needless to say that things may be a lot more complicated than in your case.

You probably want to know, where a change occurred, or text may be identical, but has changed its position on the page.
- Approach: use a more sophisticated text extraction, one that also returns the coordinates with each text piece, like page.get_text("dict"). See next post for an example way.
Images may have changed, or have changed their position.
- Approach: extract images, and / or image rectangles. Also possible to access with PyMuPDF - sparing out the details here.

0 replies

JorjMcKie · 2021-02-09T12:37:39Z

JorjMcKie
Feb 9, 2021
Maintainer

Using a text extraction that also yields text positions, something like the following analysis is possible:
I extracted contiguous text pieces ("spans") and sorted them in both lists by position their on the page. Then compared each span and, if different, printed the span number, old text and new text, each with its starting position.
This was the output:

Item 38 is different:
2,412.00 ==> at: 442.7 / 425.8
11,33,466.65 ==> at: 504.5 / 425.8
--------------------
Item 39 is different:
11,33,466.65 ==> at: 504.5 / 425.8
12.00 ==> at: 443.0 / 426.3
--------------------

You see, that you not only changed the amount from "2,412.00" to "12.00", but also slightly the bottom position, from 425.8 to 426.3.
Which causes the new amount to appear after the line of "11,33,466.65".
With an exact positioning of the fake amount "12.00", only one different span would have been reported.

0 replies

Is there any way that I can identify whether the PDF is edited/tampered and the exact location where the PDF is edited/tampered using Python? #892

Uh oh!

AbhishekTanksali Feb 9, 2021

Replies: 3 comments · 8 replies

Uh oh!

Uh oh!

JorjMcKie Feb 9, 2021 Maintainer

Uh oh!

JorjMcKie Feb 9, 2021 Maintainer

Uh oh!

AbhishekTanksali Feb 10, 2021 Author

Uh oh!

Uh oh!

AbhishekTanksali Feb 10, 2021 Author

Uh oh!

JorjMcKie Feb 10, 2021 Maintainer

Uh oh!

AbhishekTanksali Feb 10, 2021 Author

Uh oh!

Uh oh!

JorjMcKie Feb 9, 2021 Maintainer

Uh oh!

Uh oh!

JorjMcKie Feb 9, 2021 Maintainer

AbhishekTanksali
Feb 9, 2021

Replies: 3 comments 8 replies

JorjMcKie
Feb 9, 2021
Maintainer

JorjMcKie Feb 9, 2021
Maintainer

AbhishekTanksali Feb 10, 2021
Author

AbhishekTanksali Feb 10, 2021
Author

JorjMcKie Feb 10, 2021
Maintainer

AbhishekTanksali Feb 10, 2021
Author

JorjMcKie
Feb 9, 2021
Maintainer

JorjMcKie
Feb 9, 2021
Maintainer