Skip to content
vadasg edited this page Dec 9, 2012 · 8 revisions

The following must be installed before use:

groovy 1.8.8 or newer (there is a known bug in json parsing in groovy 1.8.6)
ruby-yajl
Titan Distributed Graph Database

Parsing into graph representation

The python script AutomatedParallelParser.py automates downloading and parsing the data in parallel. Review the settings at the top of this script, paying particular attention to

#system specific settings
sortMem = '14G'   #memory for sort. maximize.
threads = 8 

#start and end hours to fetch from GitHubArchive
startHour = '2012-03-12-01'  #set to 'beginning' for earliest possible
endHour = '2012-11-09-23'    #set to 'now' for last possible

It will take a few hours to parse the entire archive. Start the script with

$ export LC_ALL="C"
$ export JAVA_OPTIONS="-Xmx1G"
$ python AutomatedParallelParser.py batch

This script will download GitHub Archive files in the specified time range, uncompress them, and preformat the records. The downloaded and preformatted files are placed in the scratchDir specified in the script. If parsing is interrupted for some reason and restarted, the locally cached files will be used, saving time. To delete these files after they are no longer needed, use

$ python AutomatedParallelParser.py clean

Importing graph into Titan

To start loading, first review the options at the beginning of ImportGitHubArchive.groovy then do

$ export JAVA_OPTIONS="-Xmx12G"
$ gremlin -e ImportGitHubArchive.groovy <path to vertex file> <path to edge file>

Loading the data on an m1.xlarge Titan/HBase instance on Amazon EC2 takes about 11 hours. When it is finished you should see a summary like this:

Done.  Statistics:
28590508 vertices
79374999 edges
40425.722 seconds elapsed
Clone this wiki locally