-
Notifications
You must be signed in to change notification settings - Fork 3
Getting Started
The following must be installed before use:
groovy 1.8.8 or newer (there is a known bug in json parsing in groovy 1.8.6)
ruby-yajl
Titan Distributed Graph Database
The python script AutomatedParallelParser.py
automates downloading and parsing the data in parallel. Review the settings at the top of this script, paying particular attention to
#system specific settings
sortMem = '14G' #memory for sort. maximize.
threads = 8
#start and end hours to fetch from GitHubArchive
startHour = '2012-03-12-01' #set to 'beginning' for earliest possible
endHour = '2012-11-09-23' #set to 'now' for last possible
It will take a few hours to parse the entire archive. Start the script with
$ export LC_ALL="C"
$ export JAVA_OPTIONS="-Xmx1G"
$ python AutomatedParallelParser.py batch
This script will download GitHub Archive files in the specified time range, uncompress them, and preformat the records. The downloaded and preformatted files are placed in the scratchDir
specified in the script. If parsing is interrupted for some reason and restarted, the locally cached files will be used, saving time. To delete these files after they are no longer needed, use
$ python AutomatedParallelParser.py clean
To start loading, first review the options at the beginning of ImportGitHubArchive.groovy
then do
$ export JAVA_OPTIONS="-Xmx12G"
$ gremlin -e ImportGitHubArchive.groovy <path to vertex file> <path to edge file>
Loading the data on an m1.xlarge Titan/HBase instance on Amazon EC2 takes about 11 hours. When it is finished you should see a summary like this:
Done. Statistics:
28590508 vertices
79374999 edges
40425.722 seconds elapsed