Skip to content

Interacting with ImageCat

Chris Mattmann edited this page Apr 22, 2018 · 1 revision

Observing what's going on

ImageCat runs 2 Solr deployments, and a full stack OODT Deployment. The URLs are below:

The recommended way to see what's going on is to check the OPSUI, and then periodically examine $OODT_HOME/data/jobs/crawl/*/logs (where the ingest into SolrCell jobs are executing). By default ImageCat uses 8 ingest processes that can run 8 parallel ingests into SolrCell at a time, with 24 jobs on deck in the Resource Manager waiting to get in.

Each directory in $OODT_HOME/data/jobs/crawl/ is an independent, fully detached job that can be executed independent of OODT to ingest 50K image files into SolrCell and to perform TesesractOCR and EXIF metadata extraction.

Note that sometimes images will fail to ingest, e.g., with a message such as:

INFO: on.SolrException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.jpeg.JpegParser@5c0bae4a
OUTPUT:         at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
OUTPUT:         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
OUTPUT:         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
OUTPUT:         at org.apache.solr.core.RequestHandler
Apr 15, 2015 9:18:29 PM org.apache.oodt.commons.io.LoggerOutputStream flush

In the Solr Tomcat logs. This is normal, since sometimes the JpegParser will fail to parse the image.

Chunk Files

The overall workflow is as follows:

  1. OODT starts with the original large file that contains full file paths. It then chunks this file into sizeof(file) / $OODT_HOME/workflow/policy/tasks.xml[urn:id:memex:Chunker/ChunkSize] sized files.

  2. Each resultant ChunkFile is then ingested into OODT, by the OODT crawler, which triggers the OODT workflow manager to process a job called IngestInPlace.

  3. Each IngestInPlace job grabs its ingested ChunkFile (stored in $OODT_HOME/data/archive/chunks/) and then runs it through $OODT_HOME/bin/solrcell_ingest which sends the 50k full file paths to http://localhost:8081/solr/imagecatdev/extract (the ExtractingRequestHandler).

  4. 8 IngestInPlace jobs can run at a time.

  5. You can watch http://localhost:8081/solr/imagecatdev build up while it's going on. This will happen sporadically b/c $OODT_HOME/bin/solrcell_ingest ingests all 50k files in memory, and then sends a commit at the end for efficiency (resulting in 50k * 8 files every ~30-40 minutes).

Clone this wiki locally