-
Notifications
You must be signed in to change notification settings - Fork 1
Maven Indexing
To obtain a representative test base for Jade, we decided to index the Maven Central repository.
The main repository is hosted by Sonatype. It is difficult to access programmatically.
Starting in 2015, Google began hosting a mirror of the repository on Google Cloud Storage. The root of this repository is located at https://storage.googleapis.com/maven-central/
.
Although the robots.txt
file says all user-agents are prohibited, we emailed Les Vogel through the [email protected]
address and he said we were fine to do whatever we wanted (within reason).
Accessing files on Google Cloud Storage is relatively straightforward. A (very simple) example is given here under the "Cloud Storage" heading, but more specific examples will be given below.
(All example code given is in Python 3.7.0.)
To access files on Google Cloud Storage (GCS), it appears necessary to have an account with Google Cloud Platform (GCP). Accounts can be created for free. Once an account is created, follow these instructions (under the heading "Obtaining and providing service account credentials manually" in the "GCP CONSOLE" box) to obtain an authentication .json
file on your local machine. Almost nothing matters about the configuration except that the file correctly corresponds to your account.
Once you have your file (which I will refer to as auth.json
) on your local machine, install the GCS Python library via PIP:
$ pip install google-cloud-storage
(Note that if you use both Python 2 and 3, you may need to specify pip3
or else a full path to the appropriate pip
executable for your Python interpreter of choice.)
To interact with the repository, we need to obtain a Bucket:
from google.cloud import storage
MAVEN_BUCKET = 'maven-central'
AUTH_FILE = 'auth.json'
client = storage.Client.from_service_account_json(AUTH_FILE)
bucket = client.get_bucket(MAVEN_BUCKET)
We now have access to the Bucket. There are many methods in the Bucket object, but we only care about bucket.list_blobs()
, which will provide an iterator over all the blobs (object) in the bucket (repository). (This method takes an optional parameter, max_results
, which denoted the maximum number of blobs to iterate through.)
A tab-separated index file (index.tsv
) can be generated:
[...]
i = 0
with open('index.tsv', 'w') as f:
for blob in bucket.list_blobs():
f.write(f"{i}\t{blob.name}\t{blob.size}\n")
The file will have three columns: the number of the blob in the index, the name of the blob (which is the full file name in the repository), and the size of that blob in bytes.
Note that building the index took just shy of 9 hours.
To download a blob blob-name
to a file file-name
, simply do:
[...]
blob = bucket.get_blob('blob-name')
blob.download_to_filename('file-name')
Now we have an index index.tsv
that tells us all of the files in Maven as well as their sizes. As of this writing, a little processing provided the following statistics from the index:
- Index file is 8.3 GiB
- ~71mm blobs (71,364,531)
- ~7.8mm .jar files (7,752,139)
- ~270k artifacts (269,285)
- Total size of all files in repo: ~9TiB (10,103,426,642,816 bytes)
- Total size of just .jar files: ~4TiB (4,586,501,379,706 bytes)