-
Notifications
You must be signed in to change notification settings - Fork 31
Workflows
The two commands at the core of Extract are queue
and spew
. The former will recursively scan a given path and queue all the files it finds (with pattern exceptions possible) in a distributed queue. The latter will either pull files from a distributed queue or scan a path in a separate thread, using its own internal queue, and spew out text and metadata.
If you're only processing a few thousand files, then running a single instance of Extract without a queue is sufficient:
extract spew -r redis -o file --outputDirectory /path/to/text /path/to/files
The -r
parameter is used to tell Extract to save the result of each file processed to Redis. In this way, if you have to stop the process, then you can resume where you left off as successfully processed files will be skipped.
You'll probably want to do something more useful with extracted text than save to disk. In that case, you can get Extract to write to a Solr endpoint:
extract spew -r redis -o solr -s http://solr-1:8983/solr/my_core /path/to/files`
When spewing is done, trigger a commit so that results will show up in Solr:
extract commit -s http://solr-1:8983/solr/my_core`
Or rollback the changes:
extract rollback -s http://solr-1:8983/solr/my_core`
This is the workflow we use at ICIJ for processing millions of files in parallel. The --queueName
parameter is used to namespace the job and avoid conflicts with unrelated jobs using the same Redis server.
First, queue the files from your directory. For best performance, you should probably run this directly on the machine that the volume containing the files is connected to and not over the network:
cd /mnt/my_files
extract queue --queueName job-1 --redisAddress redis-1:6379 ./ 2> queue.log`
You will be running Extract processes on many different machines, so your should export your file directory as an NFS or kind of network share. After that, mount the share to the same path on each of your extraction cluster machines.
With NFS, this would be done in the following way (where nfs-1
is the hostname of your file server):
sudo mkdir /mnt/my_files
sudo mount -t nfs4 -o ro,proto=tcp,port=2049 nfs-1:/my_files /mnt/my_files
You can then start processing the queue on each of your machines.
cd /mnt/my_files
extract spew --queueName job-1 -q redis -o solr -s http://solr-1:8983/solr/my_core -r redis --redisAddress redis-1:6379 2> extract.log
In the last step, we instruct Extract to use the queue from Redis, to output extracted text to Solr (-o solr
) at the given address and to report results to Redis (-r redis
).
It's possible to dump a queue or report to a backup file in case we need to restore either later on.
extract dump-queue --queueName suspicious-files --redisAddress redis-1:6379 queue.json
extract dump-report --reportName suspicious-files --redisAddress redis-1:6379 report.json
Restoring is simple:
extract load-queue --queueName suspicious-files --redisAddress redis-1:6379 queue.json
extract load-report --reportName suspicious-files --redisAddress redis-1:6379 report.json
It's also possible to use I/O redirection if your command line environment supports it. For example, the above commands could be rewritten as:
extract dump-queue --queueName suspicious-files --redisAddress redis-1:6379 > queue.json
extract wipe-queue --queueName suspicious-files --redisAddress redis-1:6379
extract load-queue --queueName suspicious-files --redisAddress redis-1:6379 < queue.json
You might have made a mistake in your original schema and now need to change the type of a field, or changed the way it's tokenised. You can edit the schema and make as many changes as you like, but the original data would still be stored and indexed as specified in the old schema.
There are two ways you can work around this: reindex all your files again, or use the solr-copy
command, which pulls the fields you specify from each document and adds them back to the same document, forcing reindexing.
A common example is when you change a string field to a Trie
number field after indexing. Solr will then return an error message in place of these fields. To fix them automatically, run solr-copy
filtering on the bad field.
extract copy -f "my_numeric_field:* AND -my_numeric_field:[0 TO *]" -s ...
This will cause the copy command to run only on those fields which have a non-number value on the number-type field.
Internally, Extract will perform an atomic update of the specified fields only, without causing the other document fields to be lost.