Skip to content

Getting Started

Matthew Caruana Galizia edited this page Nov 14, 2016 · 2 revisions

First, build Extract. We recommend copying the extract.jar file to a location that makes it available system-wide and installing a small wrapping script that allows you to execute it from anywhere.

You'll need to set JAVA_OPTS before getting Extract to run. It will pass whatever is in this environment variable to the JVM. At a minimum, you should set the amount of memory that will be made available to it. For example:

echo "export JAVA_OPTS=\"-Xms512m -Xmx1024m\"" >> ~/.bashrc
source ~/.bashrc

From then on, Extract will have up to 1GB of memory available to it.

Run extract -h to view a list or available commands and extract -h [command] for help on a particular command.

OCR

Remember that text will not be extracted from images (including those embedded in PDFs) unless you have Tesseract installed.

Workflows

There are many ways to use Extract, in a distributed, parallel processing setup or with a single instance. See our Workflows page.

Clone this wiki locally