Is there any way that I can identify whether the PDF is edited/tampered and the exact location where the PDF is edited/tampered using Python? #892
-
Hi Folks, I am working on identifying forgery/tampering in bank statements PDF documents. Info metadata and XMP metadata is not always present in the PDFs that I have so I am not able to create any generalized rule to identify tampered PDFs. I am using Python libraries such as PyMuPDF, PDFMiner, PyPDF2 etc. I have 2 questions: Is there any concrete way to identify whether the PDF is tampered (using PyMuPDF/Python/any other opensource technology) ? original :- "sbi statment_out2.pdf" link - https://drive.google.com/file/d/1DoWAKYcCudRO-Cwjbgf7RjiJUsF3DD3s/view?usp=sharing Tampered using Sejda online editor :- "sbi statment_out2_Sejda_edited.pdf link - https://drive.google.com/file/d/1J4eRy9tO3jN8AqEWNrKXtn40G6vdH5G3/view?usp=sharing In tempered PDF, I have edited '2,412.00' under 'Credit' column to '12.00'. Kindly let me know in case any open source solution, preferably in Python. Thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 8 replies
-
In your example cases, the answer to the question "Are the PDFs equal?" is simple. On the PDF level, you can look at some core indicators: >>> import fitz
>>> doc1 = fitz.open("sbi statment_out2.pdf")
>>> doc2 = fitz.open("sbi statment_out2_Sejda_edited.pdf")
>>> from pprint import pprint
>>> pprint(doc1.metadata)
{'author': '',
'creationDate': "D:20200911140637+05'30'",
'creator': '',
'encryption': None,
'format': 'PDF 1.4',
'keywords': '',
'modDate': "D:20200911140637+05'30'",
'producer': 'iText 2.0.4 (by lowagie.com)',
'subject': '',
'title': '',
'trapped': ''}
>>> pprint(doc2.metadata)
{'author': '',
'creationDate': "D:20200911140637+05'30'",
'creator': 'sejda.com (4.1.7)',
'encryption': None,
'format': 'PDF 1.5',
'keywords': '',
'modDate': "D:20210209042635+01'00'",
'producer': 'SAMBox 2.2.12',
'subject': '',
'title': '',
'trapped': ''} As you can see, a lot of parameters were changed when you created the modified file. The PDF trailer contains an >>> print(doc1.pdf_trailer())
<<
/Size 23
/Info 10 0 R
/Root 9 0 R
/ID [ <972B505462C2D50EFA775DC2D33B032D> <F5C4CAA260BAF2A9A812728CBBE31A03> ]
>>
>>> print(doc2.pdf_trailer())
<<
/Size 25
/Info 3 0 R
/Root 2 0 R
/ID [ <44C24EDE24FE0529FCEC5A78BD6CBCE9> <44C24EDE24FE0529FCEC5A78BD6CBCE9> ]
/Type /XRef
/Index [ 0 25 ]
/W [ 1 2 2 ]
/DL 125
/Filter /FlateDecode
/Length 85
>>
>>> You see (again) that your modification has created a completely new file, and therefore has a different first item than the original. So much for the question "Have changes occurred at all?". |
Beta Was this translation helpful? Give feedback.
-
Probably needless to say that things may be a lot more complicated than in your case.
|
Beta Was this translation helpful? Give feedback.
-
Using a text extraction that also yields text positions, something like the following analysis is possible:
You see, that you not only changed the amount from "2,412.00" to "12.00", but also slightly the bottom position, from 425.8 to 426.3. |
Beta Was this translation helpful? Give feedback.
In your example cases, the answer to the question "Are the PDFs equal?" is simple.
You can use the file creation / modification timestamps to answer this, and / or the file sizes. Best use the Python built-in module
os
here.On the PDF level, you can look at some core indicators: