Add a cache to the registry source

alexcrichton · alexcrichton · commit 5d1ea9ff45b7 · 2014-10-27T12:40:23.000-07:00
This should help avoid hitting disk wherever possible and will also be leveraged
in just a second!
diff --git a/src/cargo/sources/registry.rs b/src/cargo/sources/registry.rs
@@ -1,3 +1,163 @@
+//! A `Source` for registry-based packages.
+//!
+//! # What's a Registry?
+//!
+//! Registries are central locations where packages can be uploaded to,
+//! discovered, and searched for. The purpose of a registry is to have a
+//! location that serves as permanent storage for versions of a crate over time.
+//!
+//! Compared to git sources, a registry provides many packages as well as many
+//! versions simultaneously. Git sources can also have commits deleted through
+//! rebasings where registries cannot have their versions deleted.
+//!
+//! # The Index of a Registry
+//!
+//! One of the major difficulties with a registry is that hosting so many
+//! packages may quickly run into performance problems when dealing with
+//! dependency graphs. It's infeasible for cargo to download the entire contents
+//! of the registry just to resolve one package's dependencies, for example. As
+//! a result, cargo needs some efficient method of querying what packages are
+//! available on a registry, what versions are available, and what the
+//! dependencies for each version is.
+//!
+//! One method of doing so would be having the registry expose an HTTP endpoint
+//! which can be queried with a list of packages and a response of their
+//! dependencies and versions is returned. This is somewhat inefficient however
+//! as we may have to hit the endpoint many times and we may have already
+//! queried for much of the data locally already (for other packages, for
+//! example). This also involves inventing a transport format between the
+//! registry and Cargo itself, so this route was not taken.
+//!
+//! Instead, Cargo communicates with registries through a git repository
+//! referred to as the Index. The Index of a registry is essentially an easily
+//! query-able version of the registry's database for a list of versions of a
+//! package as well as a list of dependencies for each version.
+//!
+//! Using git to host this index provides a number of benefits:
+//!
+//! * The entire index can be stored efficiently locally on disk. This means
+//!   that all queries of a registry can happen locally and don't need to touch
+//!   the network.
+//!
+//! * Updates of the index are quite efficient. Using git buys incremental
+//!   updates, compressed transmission, etc for free. The index must be updated
+//!   each time we need fresh information from a registry, but this is one
+//!   update of a git repository that probably hasn't changed a whole lot so
+//!   it shouldn't be too expensive.
+//!
+//!   Additionally, each modification to the index is just appending a line at
+//!   the end of a file (the exact format is described later). This means that
+//!   the commits for an index are quite small and easily applied/compressable.
+//!
+//! ## The format of the Index
+//!
+//! The index is a store for the list of versions for all packages known, so its
+//! format on disk is optimized slightly to ensure that `ls registry` doesn't
+//! produce a list of all packages ever known. The index also wants to ensure
+//! that there's not a million files which may actually end up hitting
+//! filesystem limits at some point. To this end, a few decisions were made
+//! about the format of the registry:
+//!
+//! 1. Each crate will have one file corresponding to it. Each version for a
+//!    crate will just be a line in this file.
+//! 2. There will be two tiers of directories for crate names, under which
+//!    crates corresponding to those tiers will be located.
+//!
+//! As an example, this is an example hierarchy of an index:
+//!
+//! ```notrust
+//! .
+//! ├── 3
+//! │   └── u
+//! │       └── url
+//! ├── bz
+//! │   └── ip
+//! │       └── bzip2
+//! ├── config.json
+//! ├── en
+//! │   └── co
+//! │       └── encoding
+//! └── li
+//!     ├── bg
+//!     │   └── libgit2
+//!     └── nk
+//!         └── link-config
+//! ```
+//!
+//! The root of the index contains a `config.json` file with a few entries
+//! corresponding to the registry (see `RegistryConfig` below).
+//!
+//! Otherwise, there are three numbered directories (1, 2, 3) for crates with
+//! names 1, 2, and 3 characters in length. The 1/2 directories simply have the
+//! crate files underneath them, while the 3 directory is sharded by the first
+//! letter of the crate name.
+//!
+//! Otherwise the top-level directory contains many two-letter directory names,
+//! each of which has many sub-folders with two letters. At the end of all these
+//! are the actual crate files themselves.
+//!
+//! The purpose of this layou tis to hopefully cut down on `ls` sizes as well as
+//! efficient lookup based on the crate name itself.
+//!
+//! ## Crate files
+//!
+//! Each file in the index is the history of one crate over time. Each line in
+//! the file corresponds to one version of a crate, stored in JSON format (see
+//! the `RegistryPackage` structure below).
+//!
+//! As new versions are published, new lines are appended to this file. The only
+//! modifications to this file that should happen over time are yanks of a
+//! particular version.
+//!
+//! # Downloading Packages
+//!
+//! The purpose of the Index was to provide an efficient method to resolve the
+//! dependency graph for a package. So far we only required one network
+//! interaction to update the registry's repository (yay!). After resolution has
+//! been performed, however we need to download the contents of packages so we
+//! can read the full manifest and build the source code.
+//!
+//! To accomplish this, this source's `download` method will make an HTTP
+//! request per-package requested to download tarballs into a local cache. These
+//! tarballs will then be unpacked into a destination folder.
+//!
+//! Note that because versions uploaded to the registry are frozen forever that
+//! the HTTP download and unpacking can all be skipped if the version has
+//! already been downloaded and unpacked. This caching allows us to only
+//! download a package when absolutely necessary.
+//!
+//! # Filesystem Hierarchy
+//!
+//! Overall, the `$HOME/.cargo` looks like this when talking about the registry:
+//!
+//! ```notrust
+//! # A folder under which all registry metadata is hosted (similar to
+//! # $HOME/.cargo/git)
+//! $HOME/.cargo/registry/
+//!
+//!     # For each registry that cargo knows about (keyed by hostname + hash)
+//!     # there is a folder which is the checked out version of the index for
+//!     # the registry in this location. Note that this is done so cargo can
+//!     # support multiple registries simultaneously
+//!     index/
+//!         registry1-<hash>/
+//!         registry2-<hash>/
+//!         ...
+//!
+//!     # This folder is a cache for all downloaded tarballs from a registry.
+//!     # Once downloaded and verified, a tarball never changes.
+//!     cache/
+//!         registry1-<hash>/<pkg>-<version>.tar.gz
+//!         ...
+//!
+//!     # Location in which all tarballs are unpacked. Each tarball is known to
+//!     # be frozen after downloading, so transitively this folder is also
+//!     # frozen once its unpacked (it's never unpacked again)
+//!     src/
+//!         registry1-<hash>/<pkg>-<version>/...
+//!         ...
+//! ```
+
 use std::io::{mod, fs, File};
 use std::io::fs::PathExtensions;
 use std::collections::HashMap;
@@ -28,11 +188,18 @@ pub struct RegistrySource<'a, 'b:'a> {
     handle: Option<http::Handle>,
     sources: Vec<PathSource>,
     hashes: HashMap<(String, String), String>, // (name, vers) => cksum
+    cache: HashMap<String, Vec<(Summary, bool)>>,
 }
 
 #[deriving(Decodable)]
 pub struct RegistryConfig {
+    /// Download endpoint for all crates. This will be appended with
+    /// `/<crate>/<version>/download` and then will be hit with an HTTP GET
+    /// request to download the tarball for a crate.
     pub dl: String,
+
+    /// API endpoint for the registry. This is what's actually hit to perform
+    /// operations like yanks, owner modifications, publish new crates, etc.
     pub api: String,
 }
 
@@ -71,6 +238,7 @@ impl<'a, 'b> RegistrySource<'a, 'b> {
             handle: None,
             sources: Vec::new(),
             hashes: HashMap::new(),
+            cache: HashMap::new(),
         }
     }
 
@@ -185,6 +353,39 @@ impl<'a, 'b> RegistrySource<'a, 'b> {
         Ok(dst)
     }
 
+    /// Parse the on-disk metadata for the package provided
+    fn summaries(&mut self, name: &str) -> CargoResult<&Vec<(Summary, bool)>> {
+        if self.cache.contains_key_equiv(&name) {
+            return Ok(self.cache.find_equiv(&name).unwrap());
+        }
+        // see module comment for why this is structured the way it is
+        let path = self.checkout_path.clone();
+        let path = match name.len() {
+            1 => path.join("1").join(name),
+            2 => path.join("2").join(name),
+            3 => path.join("3").join(name.slice_to(1)).join(name),
+            _ => path.join(name.slice(0, 2))
+                     .join(name.slice(2, 4))
+                     .join(name),
+        };
+        let summaries = match File::open(&path) {
+            Ok(mut f) => {
+                let contents = try!(f.read_to_string());
+                let ret: CargoResult<Vec<(Summary, bool)>>;
+                ret = contents.as_slice().lines().filter(|l| l.trim().len() > 0)
+                              .map(|l| self.parse_registry_package(l))
+                              .collect();
+                try!(ret.chain_error(|| {
+                    internal(format!("Failed to parse registry's information \
+                                      for: {}", name))
+                }))
+            }
+            Err(..) => Vec::new(),
+        };
+        self.cache.insert(name.to_string(), summaries);
+        Ok(self.cache.find_equiv(&name).unwrap())
+    }
+
     /// Parse a line from the registry's index file into a Summary for a
     /// package.
     ///
@@ -223,32 +424,10 @@ impl<'a, 'b> RegistrySource<'a, 'b> {
 
 impl<'a, 'b> Registry for RegistrySource<'a, 'b> {
     fn query(&mut self, dep: &Dependency) -> CargoResult<Vec<Summary>> {
-        let name = dep.get_name();
-        let path = self.checkout_path.clone();
-        let path = match name.len() {
-            1 => path.join("1").join(name),
-            2 => path.join("2").join(name),
-            3 => path.join("3").join(name.slice_to(1)).join(name),
-            _ => path.join(name.slice(0, 2))
-                     .join(name.slice(2, 4))
-                     .join(name),
-        };
-        let contents = match File::open(&path) {
-            Ok(mut f) => try!(f.read_to_string()),
-            Err(..) => return Ok(Vec::new()),
-        };
-
-        let ret: CargoResult<Vec<(Summary, bool)>>;
-        ret = contents.as_slice().lines().filter(|l| l.trim().len() > 0)
-                      .map(|l| self.parse_registry_package(l))
-                      .collect();
-        let summaries = try!(ret.chain_error(|| {
-            internal(format!("Failed to parse registry's information for: {}",
-                             dep.get_name()))
-        }));
-        let mut summaries = summaries.into_iter().filter(|&(_, yanked)| {
+        let summaries = try!(self.summaries(dep.get_name()));
+        let mut summaries = summaries.iter().filter(|&&(_, yanked)| {
             dep.get_source_id().get_precise().is_some() || !yanked
-        }).map(|(summary, _)| summary).collect::<Vec<_>>();
+        }).map(|&(ref s, _)| s.clone()).collect::<Vec<_>>();
         summaries.query(dep)
     }
 }