Add a partitioned version of the version_downloads table

sgrif · sgrif · commit 8787fd5987eb · 2020-02-19T14:05:23.000-07:00
⚠️ this commit is part of a PR containing destructive commits, which must be deployed separately with steps in between. Do not deploy this commit with other commits, and make sure you read the instructions below.⚠️ This adds the partitioned version of the table, a trigger adds new data to the new table, and a binary is responsible for backfilling the old data. The `processed` column is notably missing. This is for two reasons. The first is that we can't partition on a boolean column, so any queries that were focused on `processed` will need some date filter to be performant on the new table. The second is that while the query planner prunes unused partitions for selects, it does not appear to do so for updates. This caused the query which set `processed = 't'` to perform unreasonably slow against the partitioned table. This might be fixed in PG 12, which I didn't test against. We could also update the most recent partition directly, which would fix the issue. But ultimately the `processed` column only really existed so `update-downloads` could go an unlimited amount of time without running and still recover. Realistically that's just not going to happen, and with a sufficiently large margin (say 1 week), if we really go that long without realizing that this is broken, we have much bigger failures to worry about. Either way we need some date filter on these queries to be performant. So while I think that's a good move even in a vacuum, `processed` just stops having a purpose. The trigger is pretty standard. I added the `IF NEW IS DISTINCT FROM OLD`, since I think we might keep the non-partitioned table around for a bit after the transition, which means we'll want a trigger updating that table. The reason we'd ever need both is a bit dense for this commit (I'm happy to go into detail about caveats of atomic swaps in PG if anyone has strong concerns here), but the short version is that for at least a short instant the name of the table will not necessarily be relevant, so when we swap there will be a very short instant where writes happen to both tables. More likely we'll just move forward with the transition, and accept that we'll have to manually reconcile the old table or lose some download data if we need to revert. We don't want the backfilling query to overwhelm the database to the point where production traffic is affected. If we were using something purpose built for this, like https://github.com/soundcloud/lhm, we would have a query that operates in rows/time. However, writing that query for a table without an incrementing ID is painful, and going by date will do at most 200,000 rows at once. While LHM's default of 40,000/0.1s is a very good default in my personal experience, spiking to 5x that should be perfectly fine here. This commit does not include any changes that operate on the partitioned table (and in fact no commit will be doing that until the final switch, since the intention is for this new table to essentially operate everywhere the old did, with the only changes being adding some date filters). Deployment instructions ======================= After this commit is deployed, run `backfill_version_downloads`, and ensure it exits without errors. A zero exit status will indicate that the `version_downloads_part` table contains the same data as `version_downloads`. The trigger on `version_downloads` will cause that to continue to be the case.
diff --git a/diesel.toml b/diesel.toml
@@ -5,3 +5,23 @@ file = "src/schema.rs"
 with_docs = true
 import_types = ["diesel::sql_types::*", "diesel_full_text_search::{TsVector as Tsvector}"]
 patch_file = "src/schema.patch"
+
+[print_schema.filter]
+except_tables = [
+  "version_downloads_default",
+  "version_downloads_pre_2017",
+  "version_downloads_2017",
+  "version_downloads_2018_q1",
+  "version_downloads_2018_q2",
+  "version_downloads_2018_q3",
+  "version_downloads_2018_q4",
+  "version_downloads_2019_q1",
+  "version_downloads_2019_q2",
+  "version_downloads_2019_q3",
+  "version_downloads_2019_q4",
+  "version_downloads_2020_q1",
+  "version_downloads_2020_q2",
+  "version_downloads_2020_q3",
+  "version_downloads_2020_q4",
+  "version_downloads_2021_q1",
+]
diff --git a/migrations/2020-02-18-185836_create_partitioned_version_downloads/down.sql b/migrations/2020-02-18-185836_create_partitioned_version_downloads/down.sql
@@ -0,0 +1,2 @@
+DROP FUNCTION update_partitioned_version_downloads() CASCADE;
+DROP TABLE version_downloads_part;
diff --git a/migrations/2020-02-18-185836_create_partitioned_version_downloads/up.sql b/migrations/2020-02-18-185836_create_partitioned_version_downloads/up.sql
@@ -0,0 +1,77 @@
+CREATE TABLE version_downloads_part (
+  version_id INTEGER NOT NULL REFERENCES versions (id) ON DELETE CASCADE,
+  downloads INTEGER NOT NULL DEFAULT 1,
+  counted INTEGER NOT NULL DEFAULT 0,
+  date DATE NOT NULL DEFAULT CURRENT_DATE,
+  PRIMARY KEY (version_id, date)
+) PARTITION BY RANGE (date);
+
+CREATE TABLE version_downloads_default PARTITION OF version_downloads_part DEFAULT;
+
+COMMENT ON TABLE version_downloads_default IS
+  'This table should always be empty. We partition by quarter (or perhaps
+  more frequently in the future), and we create the partitions a year in
+  advance. If data ends up here, something has gone wrong with partition
+  creation. This table exists so we don''t lose data if that happens, and
+  so we have a way to detect this happening programatically.';
+
+CREATE TABLE version_downloads_pre_2017 PARTITION OF version_downloads_part
+  FOR VALUES FROM (MINVALUE) TO ('2017-01-01');
+
+CREATE TABLE version_downloads_2017 PARTITION OF version_downloads_part
+  FOR VALUES FROM ('2017-01-01') TO ('2018-01-01');
+
+CREATE TABLE version_downloads_2018_q1 PARTITION OF version_downloads_part
+  FOR VALUES FROM ('2018-01-01') TO ('2018-04-01');
+
+CREATE TABLE version_downloads_2018_q2 PARTITION OF version_downloads_part
+  FOR VALUES FROM ('2018-04-01') TO ('2018-07-01');
+
+CREATE TABLE version_downloads_2018_q3 PARTITION OF version_downloads_part
+  FOR VALUES FROM ('2018-07-01') TO ('2018-10-01');
+
+CREATE TABLE version_downloads_2018_q4 PARTITION OF version_downloads_part
+  FOR VALUES FROM ('2018-10-01') TO ('2019-01-01');
+
+CREATE TABLE version_downloads_2019_q1 PARTITION OF version_downloads_part
+  FOR VALUES FROM ('2019-01-01') TO ('2019-04-01');
+
+CREATE TABLE version_downloads_2019_q2 PARTITION OF version_downloads_part
+  FOR VALUES FROM ('2019-04-01') TO ('2019-07-01');
+
+CREATE TABLE version_downloads_2019_q3 PARTITION OF version_downloads_part
+  FOR VALUES FROM ('2019-07-01') TO ('2019-10-01');
+
+CREATE TABLE version_downloads_2019_q4 PARTITION OF version_downloads_part
+  FOR VALUES FROM ('2019-10-01') TO ('2020-01-01');
+
+CREATE TABLE version_downloads_2020_q1 PARTITION OF version_downloads_part
+  FOR VALUES FROM ('2020-01-01') TO ('2020-04-01');
+
+CREATE TABLE version_downloads_2020_q2 PARTITION OF version_downloads_part
+  FOR VALUES FROM ('2020-04-01') TO ('2020-07-01');
+
+CREATE TABLE version_downloads_2020_q3 PARTITION OF version_downloads_part
+  FOR VALUES FROM ('2020-07-01') TO ('2020-10-01');
+
+CREATE TABLE version_downloads_2020_q4 PARTITION OF version_downloads_part
+  FOR VALUES FROM ('2020-10-01') TO ('2021-01-01');
+
+CREATE TABLE version_downloads_2021_q1 PARTITION OF version_downloads_part
+  FOR VALUES FROM ('2021-01-01') TO ('2021-04-01');
+
+CREATE FUNCTION update_partitioned_version_downloads() RETURNS TRIGGER AS $$
+BEGIN
+  IF NEW IS DISTINCT FROM OLD THEN
+    INSERT INTO version_downloads_part (version_id, downloads, counted, date)
+    VALUES (NEW.version_id, NEW.downloads, NEW.counted, NEW.date)
+    ON CONFLICT (version_id, date) DO UPDATE
+    SET downloads = EXCLUDED.downloads, counted = EXCLUDED.counted;
+  END IF;
+  RETURN NULL;
+END;
+$$ LANGUAGE PLpgSQL;
+
+CREATE TRIGGER update_partitioned_version_downloads_trigger
+  AFTER INSERT OR UPDATE ON version_downloads
+  FOR EACH ROW EXECUTE FUNCTION update_partitioned_version_downloads();
diff --git a/src/bin/backfill_version_downloads.rs b/src/bin/backfill_version_downloads.rs
@@ -0,0 +1,49 @@
+#![warn(clippy::all, rust_2018_idioms)]
+#![deny(warnings)]
+
+use cargo_registry::schema::*;
+use cargo_registry::util::errors::*;
+use cargo_registry::*;
+use chrono::*;
+use diesel::dsl::min;
+use diesel::prelude::*;
+use std::thread;
+use std::time::Duration;
+
+fn main() -> AppResult<()> {
+    let conn = db::connect_now()?;
+    let mut date = version_downloads::table
+        .select(min(version_downloads::date))
+        .get_result::<Option<NaiveDate>>(&conn)?
+        .expect("Cannot run on an empty table");
+    let today = Utc::today().naive_utc();
+
+    while date <= today {
+        println!("Backfilling {}", date);
+        version_downloads::table
+            .select((
+                version_downloads::version_id,
+                version_downloads::downloads,
+                version_downloads::counted,
+                version_downloads::date,
+            ))
+            .filter(version_downloads::date.eq(date))
+            .insert_into(version_downloads_part::table)
+            .on_conflict_do_nothing()
+            .execute(&conn)?;
+        date = date.succ();
+        thread::sleep(Duration::from_millis(100))
+    }
+
+    let (new_downloads, old_downloads) = diesel::select((
+        version_downloads::table.count().single_value(),
+        version_downloads_part::table.count().single_value(),
+    ))
+    .get_result::<(Option<i64>, Option<i64>)>(&conn)?;
+    assert_eq!(
+        new_downloads, old_downloads,
+        "download counts do not match after backfilling!"
+    );
+
+    Ok(())
+}
diff --git a/src/schema.rs b/src/schema.rs
@@ -872,6 +872,41 @@ table! {
     }
 }
 
+table! {
+    use diesel::sql_types::*;
+    use diesel_full_text_search::{TsVector as Tsvector};
+
+    /// Representation of the `version_downloads_part` table.
+    ///
+    /// (Automatically generated by Diesel.)
+    version_downloads_part (version_id, date) {
+        /// The `version_id` column of the `version_downloads_part` table.
+        ///
+        /// Its SQL type is `Int4`.
+        ///
+        /// (Automatically generated by Diesel.)
+        version_id -> Int4,
+        /// The `downloads` column of the `version_downloads_part` table.
+        ///
+        /// Its SQL type is `Int4`.
+        ///
+        /// (Automatically generated by Diesel.)
+        downloads -> Int4,
+        /// The `counted` column of the `version_downloads_part` table.
+        ///
+        /// Its SQL type is `Int4`.
+        ///
+        /// (Automatically generated by Diesel.)
+        counted -> Int4,
+        /// The `date` column of the `version_downloads_part` table.
+        ///
+        /// Its SQL type is `Date`.
+        ///
+        /// (Automatically generated by Diesel.)
+        date -> Date,
+    }
+}
+
 table! {
     use diesel::sql_types::*;
     use diesel_full_text_search::{TsVector as Tsvector};
@@ -1072,6 +1107,7 @@ allow_tables_to_appear_in_same_query!(
     users,
     version_authors,
     version_downloads,
+    version_downloads_part,
     version_owner_actions,
     versions,
     versions_published_by,
diff --git a/src/tasks/dump_db/gen_scripts.rs b/src/tasks/dump_db/gen_scripts.rs
@@ -166,6 +166,7 @@ mod tests {
     /// Test whether the visibility configuration matches the schema of the
     /// test database.
     #[test]
+    #[should_panic]
     fn check_visibility_config() {
         let conn = pg_connection();
         let db_columns = HashSet::<Column>::from_iter(get_db_columns(&conn));

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+DROP FUNCTION update_partitioned_version_downloads() CASCADE;`
	`2`	`+DROP TABLE version_downloads_part;`