Maven Indexing

To obtain a representative test base for Jade, we decided to index the Maven Central repository.

The main repository is hosted by Sonatype. It is difficult to access programmatically.

Starting in 2015, Google began hosting a mirror of the repository on Google Cloud Storage. The root of this repository is located at https://storage.googleapis.com/maven-central/.

Although the robots.txt file says all user-agents are prohibited, we emailed Les Vogel through the [email protected] address and he said we were fine to do whatever we wanted (within reason).

Accessing files on Google Cloud Storage is relatively straightforward. A (very simple) example is given here under the "Cloud Storage" heading, but more specific examples will be given below.

Accessing the index

(All example code given is in Python 3.7.0.)

To access files on Google Cloud Storage (GCS), it appears necessary to have an account with Google Cloud Platform (GCP). Accounts can be created for free. Once an account is created, follow these instructions (under the heading "Obtaining and providing service account credentials manually" in the "GCP CONSOLE" box) to obtain an authentication .json file on your local machine. Almost nothing matters about the configuration except that the file correctly corresponds to your account.

Once you have your file (which I will refer to as auth.json) on your local machine, install the GCS Python library via PIP:

$ pip install google-cloud-storage

(Note that if you use both Python 2 and 3, you may need to specify pip3 or else a full path to the appropriate pip executable for your Python interpreter of choice.)

To interact with the repository, we need to obtain a Bucket:

from google.cloud import storage

MAVEN_BUCKET = 'maven-central'
AUTH_FILE = 'auth.json'

client = storage.Client.from_service_account_json(AUTH_FILE)
bucket = client.get_bucket(MAVEN_BUCKET)

We now have access to the Bucket. There are many methods in the Bucket object, but we only care about bucket.list_blobs(), which will provide an iterator over all the blobs (object) in the bucket (repository). (This method takes an optional parameter, max_results, which denoted the maximum number of blobs to iterate through.)

Building a local index of files in the repository

A tab-separated index file (index.tsv) can be generated:

[...]

i = 0
with open('index.tsv', 'w') as f:
    for blob in bucket.list_blobs():
        f.write(f"{i}\t{blob.name}\t{blob.size}\n")

The file will have three columns: the number of the blob in the index, the name of the blob (which is the full file name in the repository), and the size of that blob in bytes.

Note that building the index took just shy of 9 hours.

Downloading individual blobs

To download a blob blob-name to a file file-name, simply do:

[...]

blob = bucket.get_blob('blob-name')
blob.download_to_filename('file-name')

Processing the index

Now we have an index index.tsv that tells us all of the files in Maven as well as their sizes. As of this writing, a little processing provided the following statistics from the index:

Index file is 8.3 GiB
~71mm blobs (71,364,531)
~7.8mm .jar files (7,752,139)
~270k artifacts (269,285)
Total size of all files in repo: ~9TiB (10,103,426,642,816 bytes)
Total size of just .jar files: ~4TiB (4,586,501,379,706 bytes)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maven Indexing

Accessing the index

Building a local index of files in the repository

Downloading individual blobs

Processing the index

Clone this wiki locally