Skip to content
This repository was archived by the owner on Jul 10, 2019. It is now read-only.

DigitalPebble/behemoth-commoncrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

behemoth-commoncrawl

CommonCrawl module for Behemoth. This is for converting from the old ARC-based (pre-2013) CommonCrawl dataset. For the new CC format based on WARC, use the IO module instead [https://github.com/DigitalPebble/behemoth/tree/master/io].

NOTE : YOU NEED TO HAVE AN AWS ACCOUNT AND SET AWS_ACCESS_KEY AND AWS_SECRET_ACCESS_KEY.

INSTRUCTIONS

  • git clone [email protected]:DigitalPebble/behemoth-commoncrawl.git
  • mvn install:install-file -DgroupId=org.commoncrawl -DartifactId=commoncrawl -Dversion=1.0 -Dpackaging=jar -Dfile=lib/commoncrawl-1.0.jar
  • mvn clean install
  • hadoop jar ./target/behemoth-commoncrawl-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.commoncrawl.CommonCrawlConverterJob2012 -D fs.s3n.awsAccessKeyId=$AWS_ACCESS_KEY -D fs.s3n.awsSecretAccessKey=$AWS_SECRET_ACCESS_KEY s3n://aws-publicdatasets/common-crawl/parse-output/segment/1350433107105/* cc-test
  • check the output with
  • hadoop jar ./target/behemoth-commoncrawl-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.util.CorpusReader -i cc-test

The converter can also take as input the text version of the cc dataset by adding the -text parameter e.g.

hadoop jar ./target/behemoth-commoncrawl-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.commoncrawl.CommonCrawlConverterJob2012 -D fs.s3n.awsAccessKeyId=$AWS_ACCESS_KEY -D fs.s3n.awsSecretAccessKey=$AWS_SECRET_ACCESS_KEY s3n://aws-publicdatasets/common-crawl/parse-output/segment/1350433107105/* cc-test -text

which adds the text found to the text field of the BehemothDocuments. Note that the fields contentType and content are not set as they are pertaining to the original content and not the extracted text.

About

Support for old (pre 2013) CommonCrawl dataset in Behemoth

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages