-
Notifications
You must be signed in to change notification settings - Fork 31
Supported Formats
Matthew Caruana Galizia edited this page Jan 7, 2017
·
2 revisions
Extract is built on top of Apache Tika and supports the same formats.
However, it bundles some additional components that allow it support formats that Tika does not when used alone:
- JBIG2 ImageIO and JAI Image I/O Tools Core for reading JBIG2 and JPEG 2000 (JPX) files embedded in PDF files, as required by PDFBox.
Text is always extracted recursively, so that text from embedded files is concatenated into the stream of text from the parent document, even when spawning is chosen as the embed handling mode.