Skip to content

Supported Formats

Matthew Caruana Galizia edited this page Jan 7, 2017 · 2 revisions

Extract is built on top of Apache Tika and supports the same formats.

However, it bundles some additional components that allow it support formats that Tika does not when used alone:

  • JBIG2 ImageIO and JAI Image I/O Tools Core for reading JBIG2 and JPEG 2000 (JPX) files embedded in PDF files, as required by PDFBox.

Recursive Extraction

Text is always extracted recursively, so that text from embedded files is concatenated into the stream of text from the parent document, even when spawning is chosen as the embed handling mode.

Clone this wiki locally