-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract elements with metadata #46
Comments
I'm wondering about this as well |
Page # would be huge |
Hi all, yes we do supports extracting some structured content when using extract to xml feature. I agree it's not very clear in the docs. We are working on that 🫡 .. Here is an example in python #!/usr/bin/env python3
from extractous import Extractor
file_path = "dataset/sec10-filings/2022_Q3_AAPL.pdf"
result, metadata = Extractor().set_xml_output(True).extract_file_to_string(file_path)
print(result) The output would be XHTML, most of the text will be in <p> divs all enclosed in <div class="page"> page divs something like: <body>
<div class="page">
<p/>
<p>UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
</p>
<p>Washington, D.C. 20549
</p>
<p>FORM 10-Q
(Mark One)
</p>
....
<div class="page">
<p/>
<p>If an emerging growth company, indicate by check mark if the Registrant has elected not to use the
extended transition period for complying with any new or revised financial
accounting standards provided pursuant to Section 13(a) of the Exchange Act. ☐
</p>
On the top of that, in the extracted metadata you can get more information about the document such page count, author details etc... These are dependent on the file type and the file being parsed itself. Examples of metadata are 2022_Q3_AAPL.pdf, science-exploration-1p.pptx or the output from the above example:
Hope this helps |
First of, benchmarks look amazing!
I didn't see it in the docs, but is it possible to extract a list of elements from a document, with metadata like the page number and bounding box of that said element?
The text was updated successfully, but these errors were encountered: