Extract elements with metadata #46

tcoca27 · 2025-01-22T13:56:32Z

First of, benchmarks look amazing!

I didn't see it in the docs, but is it possible to extract a list of elements from a document, with metadata like the page number and bounding box of that said element?

cmendoza-cs · 2025-01-30T14:23:37Z

I'm wondering about this as well

cpursley · 2025-01-31T15:44:13Z

Page # would be huge

nmammeri · 2025-01-31T16:32:15Z

Hi all, yes we do supports extracting some structured content when using extract to xml feature. I agree it's not very clear in the docs. We are working on that 🫡 ..

Here is an example in python

#!/usr/bin/env python3

from extractous import Extractor

file_path = "dataset/sec10-filings/2022_Q3_AAPL.pdf"
result, metadata = Extractor().set_xml_output(True).extract_file_to_string(file_path)
print(result)

The output would be XHTML, most of the text will be in <p> divs all enclosed in <div class="page"> page divs something like:

<body>
        <div class="page">
            <p/>
            <p>UNITED STATES
                SECURITIES AND EXCHANGE COMMISSION
            </p>
            <p>Washington, D.C. 20549
            </p>
            <p>FORM 10-Q
                (Mark One)
            </p>
....
      <div class="page">
            <p/>
            <p>If an emerging growth company, indicate by check mark if the Registrant has elected not to use the
                extended transition period for complying with any new or revised financial
                accounting standards provided pursuant to Section 13(a) of the Exchange Act. ☐
            </p>

On the top of that, in the extracted metadata you can get more information about the document such page count, author details etc... These are dependent on the file type and the file being parsed itself. Examples of metadata are 2022_Q3_AAPL.pdf, science-exploration-1p.pptx or the output from the above example:

{'access_permission:fill_in_form': ['true'], 'pdf:totalUnmappedUnicodeChars': ['0'], 'access_permission:extract_content': ['true'], 'pdf:hasMarkedContent': ['false'], 
  'access_permission:modify_annotations': ['true'], 'pdf:docinfo:creator_tool': ['EDGAR Filing HTML Converter'], 'pdf:annotationTypes': ['null'], 
  'dcterms:created': ['2022-07-29T10:03:21Z'], 'Content-Length': ['266240'], 'access_permission:can_print_degraded': ['true'], 
  'X-TIKA:Parsed-By-Full-Set': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.pdf.PDFParser'], 
  'pdf:docinfo:keywords': ['0000320193-22-000070; ; 10-Q'], 'xmp:CreatorTool': ['EDGAR Filing HTML Converter'], 
  'pdf:producer': ['EDGRpdf Service w/ EO.Pdf 22.0.40.0'], 'pdf:containsNonEmbeddedFont': ['false'], 
  'dc:subject': ['0000320193-22-000070; ; 10-Q', 'Form 10-Q filed on 2022-07-29 for the period ending 2022-06-25'], 
  'access_permission:can_print': ['true'], 'X-TIKA:Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.pdf.PDFParser'], 
  'pdf:annotationSubtypes': ['Link'], 'dcterms:modified': ['2022-07-29T10:03:28Z'], 'access_permission:extract_for_accessibility': ['true'], 
  'pdf:docinfo:creator': ['EDGAR Online, a division of Donnelley Financial Solutions'], 'pdf:encrypted': ['true'], 
  'dc:creator': ['EDGAR Online, a division of Donnelley Financial Solutions'], 'pdf:PDFVersion': ['1.4'], 
  'pdf:hasXFA': ['false'], 'dc:title': ['0000320193-22-000070'], 
  'pdf:charsPerPage': ['2588', '465', '547', '1386', '1169', '1589', '1392', '1987', '2581', '2380', '2974', '3220', '2881', '2534', '2509', '1335', '4495', '3473', '3862', '2766', '2932', '4855', '3842', '1473', '332', '3220', '3233', '1653'], 
  'pdf:overallPercentageUnmappedUnicodeChars': ['0.0'], 'access_permission:can_modify': ['true'], 'pdf:hasCollection': ['false'], 
  'pdf:unmappedUnicodeCharsPerPage': ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0'], 
  'access_permission:assemble_document': ['true'], 'pdf:docinfo:title': ['0000320193-22-000070'], 'xmpTPg:NPages': ['28'], 
  'pdf:containsDamagedFont': ['false'], 'resourceName': ['2022_Q3_AAPL.pdf'], 
  'pdf:docinfo:created': ['2022-07-29T10:03:21Z'], 'pdf:hasXMP': ['false'], 'pdf:num3DAnnotations': ['0'], 
  'pdf:docinfo:modified': ['2022-07-29T10:03:28Z'], 'dc:format': ['application/pdf; version=1.4'], 
  'pdf:docinfo:producer': ['EDGRpdf Service w/ EO.Pdf 22.0.40.0'], 'Content-Type': ['application/pdf'], 
  'pdf:docinfo:subject': ['Form 10-Q filed on 2022-07-29 for the period ending 2022-06-25']}

Hope this helps

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract elements with metadata #46

Extract elements with metadata #46

tcoca27 commented Jan 22, 2025

cmendoza-cs commented Jan 30, 2025

cpursley commented Jan 31, 2025

nmammeri commented Jan 31, 2025

Extract elements with metadata #46

Extract elements with metadata #46

Comments

tcoca27 commented Jan 22, 2025

cmendoza-cs commented Jan 30, 2025

cpursley commented Jan 31, 2025

nmammeri commented Jan 31, 2025