Skip to content

Maven Indexing

Pierce Darragh edited this page Oct 30, 2018 · 8 revisions

To obtain a representative test base for Jade, we decided to index the Maven Central repository.

The main repository is hosted by Sonatype. It is difficult to access programmatically.

Starting in 2015, Google began hosting a mirror of the repository on Google Cloud Storage. The root of this repository is located at

Although the robots.txt file says all user-agents are prohibited, we emailed Les Vogel through the [email protected] address and he said we were fine to do whatever we wanted (within reason).

Accessing files on Google Cloud Storage is relatively straightforward. A (very simple) example is given here under the "Cloud Storage" heading, but a more thorough walkthrough will be given below.

(All example code given is in Python 3.7.0.)

Accessing the repository

To access files on Google Cloud Storage (GCS), it appears necessary to have an account with Google Cloud Platform (GCP). Accounts can be created for free. Once an account is created, follow these instructions (under the heading "Obtaining and providing service account credentials manually" in the "GCP CONSOLE" box) to obtain an authentication .json file on your local machine. Almost nothing matters about the configuration except that the file correctly corresponds to your account.

Once you have your file (which I will refer to as auth.json) on your local machine, install the GCS Python library via PIP:

$ pip install google-cloud-storage

(Note that if you use both Python 2 and 3, you may need to specify pip3 or else a full path to the appropriate pip executable for your Python interpreter of choice.)

To interact with the repository, we need to obtain a Bucket:

from import storage

MAVEN_BUCKET = 'maven-central'
AUTH_FILE = 'auth.json'

client = storage.Client.from_service_account_json(AUTH_FILE)
bucket = client.get_bucket(MAVEN_BUCKET)

We now have access to the Bucket. There are many methods in the Bucket object, but we only care about bucket.list_blobs(), which will provide an iterator over all the blobs (object) in the bucket (repository). (This method takes an optional parameter, max_results, which denoted the maximum number of blobs to iterate through.)

Building a local index of files in the repository

A tab-separated index file (index.tsv) can be generated:


i = 0
with open('index.tsv', 'w') as f:
    for blob in bucket.list_blobs():

The file will have three columns: the number of the blob in the index, the name of the blob (which is the full file name in the repository), and the size of that blob in bytes.

Note that building the index took just shy of 9 hours.

Downloading individual blobs

To download a blob blob-name to a file file-name, simply do:


blob = bucket.get_blob('blob-name')

Processing the index

Now we have an index index.tsv that tells us all of the files in Maven as well as their sizes. As of this writing, a little processing provided the following statistics from the index:

  • Index file is 8.3 GiB
  • ~71M blobs (71,364,531)
  • ~7.8M .jar files (7,752,139)
  • ~270k artifacts (269,285)
  • Total size of all files in repo: ~9TiB (10,103,426,642,816 bytes)
  • Total size of just .jar files: ~4TiB (4,586,501,379,706 bytes)

Hash and signature files

A quick browse through the list of files will reveal that the predominant file types by extension are:

  1. .md5 (18,072,497)
  2. .sha1 (18,050,130)
  3. .asc (10,896,218)
  4. .jar (7,752,139)
  5. .json (6,307,140)

.md5 and .sha1 files contain only hashes used to verify the integrity of other files. That is, a file may have a or (or both), in which case and/or contain hashes of the file

.asc files contain GPG signatures for a similar purpose. So a file contains the GPG signature of

It may be worth noting that most .asc files seem to also have corresponding .md5 and .sha1 files, such that it is common to see all of the following:

We wanted to be sure of this assertion, though. We needed to verified that every foo.md5, foo.sha1, or foo.asc corresponds to an existing foo. To that end, we employed the use of some one-liners for the shell.

Verifying whether every hash file corresponds to a base file

First, we produced a file containing just the filenames for every file in the index. We did this so we could later use the comm utility (which does a fast byte-wise line-by-line comparison of two files). This was done by:

$ perl -ane 'print "$F[1]\n"' < index.tsv > filenames.txt

This puts the filenames in the file filenames.txt.

Then we produced a file containing the names of files which we expect to exist based on the presence of their hash files (either .md5 or .sha1):

$ perl -ane 'if ($F[1] =~ /\.(md5|sha1)/) {print "$`\n"}' < index.tsv > hash-basenames.txt

This would take the file names from the previous section and produce:

We can see that there are some duplicates. To remove duplicates and also sort the output, we wrote small programs (TODO: link to those):

$ ./uniqsemisort < hash-basenames.txt > hash-basenames.txt

Now we can compare the expected filenames (in hash-basenames.txt) to the full list of existing filenames (filenames.txt):

$ comm -1 -3 filenames.txt hash-basenames.txt > hash-comparison.txt

The resulting output file, hash-comparison.txt, contains a list of files which we expected to exist (based on the presence of either a .md5 or .sha1 file) but which did not exist.

Analyzing the missing hash files

We came up with some 2,985 missing files.

There are 54 central-metadata.json files. These are not listed when browsing Maven Central's folders through the browser, but they are accessible.

There are 495 maven-metadata.xml files. All of these are nested inside of dot-folders (either .DAV or .svn). There is one extra #maven-metadata.xml.

There are 154 *.gz files. 150 of these are in the top-level .index directory and appear to be concerned with the index itself. 4 exist in other places.

Verifying whether every GPG signature file corresponds to a base file


Clone this wiki locally