stormcrawler-solr

Set of Apache Solr resources for StormCrawler that allows you to create topologies that consume from a Solr collection and store metrics, status or parsed content into Solr.

Getting started

The easiest way is currently to use the archetype for Solr with:

mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-solr-archetype -DarchetypeVersion=3.2.0-SNAPSHOT

You'll be asked to enter a groupId (e.g. com.mycompany.crawler), an artefactId (e.g. stormcrawler), a version, a package name and details about the user agent to use.

This will not only create a fully formed project containing a POM with the dependency above but also a set of resources, configuration files and sample topology classes. Enter the directory you just created (should be the same as the artefactId you specified earlier) and follow the instructions on the README file.

You will of course need to have both Apache Storm (2.8.0) and Apache Solr (9.8.0) installed.

Official references:

Apache Storm: Setting Up a Development Environment
Apache Solr: Installation & Deployment

Available resources

IndexerBolt: Implementation of AbstractIndexerBolt that allows to index the parsed data and metadata into a specified Solr collection.
MetricsConsumer: Class that allows to store Storm metrics in Solr.
SolrSpout: Spout that allows to get URLs from a specified Solr collection.
StatusUpdaterBolt: Implementation of AbstractStatusUpdaterBolt that allows to store the status of each URL along with the serialized metadata in Solr.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

stormcrawler-solr

Getting started

Available resources

Files

README.md

Latest commit

History

README.md

File metadata and controls

stormcrawler-solr

Getting started

Available resources