Skip to content

Latest commit

 

History

History
31 lines (20 loc) · 739 Bytes

README.md

File metadata and controls

31 lines (20 loc) · 739 Bytes

Crawl with URLFrontier

In the context of the Fed4Fire and NLNet fundings of URL Frontier.

First set the credentials for AWS as well as the FQDN of the master node in a test.properties files.

mvn clean package

Inject the seeds

java -cp ./target/crawlurlfrontier-1.0-SNAPSHOT.jar crawlercommons.urlfrontier.client.Client PutURLs -f top1M.hosts.commoncrawl

before submitting the topology using the storm command:

storm jar target/crawlurlfrontier-1.0-SNAPSHOT.jar  org.apache.storm.flux.Flux crawler.flux --filter test.properties

If the cluster is on Docker

docker exec -it nimbus bash
cd crawler
storm jar target/crawlurlfrontier-1.0-SNAPSHOT.jar  org.apache.storm.flux.Flux crawler.flux