Supported Formats

Extract is built on top of Apache Tika and supports the same formats.

However, it bundles some additional components that allow it support formats that Tika does not when used alone:

JBIG2 ImageIO and JAI Image I/O Tools Core for reading JBIG2 and JPEG 2000 (JPX) files embedded in PDF files, as required by PDFBox.

Recursive Extraction

Text is always extracted recursively, so that text from embedded files is concatenated into the stream of text from the parent document, even when spawning is chosen as the embed handling mode.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supported Formats

Recursive Extraction

Clone this wiki locally