Set of Apache Solr resources for StormCrawler that allows you to create topologies that consume from a Solr collection and store metrics, status or parsed content into Solr.
The easiest way is currently to use the archetype for Solr with:
mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-solr-archetype -DarchetypeVersion=3.2.0-SNAPSHOT
You'll be asked to enter a groupId (e.g. com.mycompany.crawler), an artefactId (e.g. stormcrawler), a version, a package name and details about the user agent to use.
This will not only create a fully formed project containing a POM with the dependency above but also a set of resources, configuration files and sample topology classes. Enter the directory you just created (should be the same as the artefactId you specified earlier) and follow the instructions on the README file.
You will of course need to have both Apache Storm (2.8.0) and Apache Solr (9.8.0) installed.
Official references:
-
IndexerBolt: Implementation of AbstractIndexerBolt that allows to index the parsed data and metadata into a specified Solr collection.
-
MetricsConsumer: Class that allows to store Storm metrics in Solr.
-
SolrSpout: Spout that allows to get URLs from a specified Solr collection.
-
StatusUpdaterBolt: Implementation of AbstractStatusUpdaterBolt that allows to store the status of each URL along with the serialized metadata in Solr.