Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract elements with metadata #46

Open
tcoca27 opened this issue Jan 22, 2025 · 3 comments
Open

Extract elements with metadata #46

tcoca27 opened this issue Jan 22, 2025 · 3 comments

Comments

@tcoca27
Copy link

tcoca27 commented Jan 22, 2025

First of, benchmarks look amazing!

I didn't see it in the docs, but is it possible to extract a list of elements from a document, with metadata like the page number and bounding box of that said element?

@cmendoza-cs
Copy link

I'm wondering about this as well

@cpursley
Copy link

Page # would be huge

@nmammeri
Copy link
Contributor

Hi all, yes we do supports extracting some structured content when using extract to xml feature. I agree it's not very clear in the docs. We are working on that 🫡 ..

Here is an example in python

#!/usr/bin/env python3

from extractous import Extractor

file_path = "dataset/sec10-filings/2022_Q3_AAPL.pdf"
result, metadata = Extractor().set_xml_output(True).extract_file_to_string(file_path)
print(result)

The output would be XHTML, most of the text will be in <p> divs all enclosed in <div class="page"> page divs something like:

<body>
        <div class="page">
            <p/>
            <p>UNITED STATES
                SECURITIES AND EXCHANGE COMMISSION
            </p>
            <p>Washington, D.C. 20549
            </p>
            <p>FORM 10-Q
                (Mark One)
            </p>
....
      <div class="page">
            <p/>
            <p>If an emerging growth company, indicate by check mark if the Registrant has elected not to use the
                extended transition period for complying with any new or revised financial
                accounting standards provided pursuant to Section 13(a) of the Exchange Act. ☐
            </p>

On the top of that, in the extracted metadata you can get more information about the document such page count, author details etc... These are dependent on the file type and the file being parsed itself. Examples of metadata are 2022_Q3_AAPL.pdf, science-exploration-1p.pptx or the output from the above example:

{'access_permission:fill_in_form': ['true'], 'pdf:totalUnmappedUnicodeChars': ['0'], 'access_permission:extract_content': ['true'], 'pdf:hasMarkedContent': ['false'], 
  'access_permission:modify_annotations': ['true'], 'pdf:docinfo:creator_tool': ['EDGAR Filing HTML Converter'], 'pdf:annotationTypes': ['null'], 
  'dcterms:created': ['2022-07-29T10:03:21Z'], 'Content-Length': ['266240'], 'access_permission:can_print_degraded': ['true'], 
  'X-TIKA:Parsed-By-Full-Set': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.pdf.PDFParser'], 
  'pdf:docinfo:keywords': ['0000320193-22-000070; ; 10-Q'], 'xmp:CreatorTool': ['EDGAR Filing HTML Converter'], 
  'pdf:producer': ['EDGRpdf Service w/ EO.Pdf 22.0.40.0'], 'pdf:containsNonEmbeddedFont': ['false'], 
  'dc:subject': ['0000320193-22-000070; ; 10-Q', 'Form 10-Q filed on 2022-07-29 for the period ending 2022-06-25'], 
  'access_permission:can_print': ['true'], 'X-TIKA:Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.pdf.PDFParser'], 
  'pdf:annotationSubtypes': ['Link'], 'dcterms:modified': ['2022-07-29T10:03:28Z'], 'access_permission:extract_for_accessibility': ['true'], 
  'pdf:docinfo:creator': ['EDGAR Online, a division of Donnelley Financial Solutions'], 'pdf:encrypted': ['true'], 
  'dc:creator': ['EDGAR Online, a division of Donnelley Financial Solutions'], 'pdf:PDFVersion': ['1.4'], 
  'pdf:hasXFA': ['false'], 'dc:title': ['0000320193-22-000070'], 
  'pdf:charsPerPage': ['2588', '465', '547', '1386', '1169', '1589', '1392', '1987', '2581', '2380', '2974', '3220', '2881', '2534', '2509', '1335', '4495', '3473', '3862', '2766', '2932', '4855', '3842', '1473', '332', '3220', '3233', '1653'], 
  'pdf:overallPercentageUnmappedUnicodeChars': ['0.0'], 'access_permission:can_modify': ['true'], 'pdf:hasCollection': ['false'], 
  'pdf:unmappedUnicodeCharsPerPage': ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0'], 
  'access_permission:assemble_document': ['true'], 'pdf:docinfo:title': ['0000320193-22-000070'], 'xmpTPg:NPages': ['28'], 
  'pdf:containsDamagedFont': ['false'], 'resourceName': ['2022_Q3_AAPL.pdf'], 
  'pdf:docinfo:created': ['2022-07-29T10:03:21Z'], 'pdf:hasXMP': ['false'], 'pdf:num3DAnnotations': ['0'], 
  'pdf:docinfo:modified': ['2022-07-29T10:03:28Z'], 'dc:format': ['application/pdf; version=1.4'], 
  'pdf:docinfo:producer': ['EDGRpdf Service w/ EO.Pdf 22.0.40.0'], 'Content-Type': ['application/pdf'], 
  'pdf:docinfo:subject': ['Form 10-Q filed on 2022-07-29 for the period ending 2022-06-25']}

Hope this helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants