Skip to content

Commit 8787fd5

Browse files
committed
Add a partitioned version of the version_downloads table
⚠️ this commit is part of a PR containing destructive commits, which must be deployed separately with steps in between. Do not deploy this commit with other commits, and make sure you read the instructions below.⚠️ This adds the partitioned version of the table, a trigger adds new data to the new table, and a binary is responsible for backfilling the old data. The `processed` column is notably missing. This is for two reasons. The first is that we can't partition on a boolean column, so any queries that were focused on `processed` will need some date filter to be performant on the new table. The second is that while the query planner prunes unused partitions for selects, it does not appear to do so for updates. This caused the query which set `processed = 't'` to perform unreasonably slow against the partitioned table. This might be fixed in PG 12, which I didn't test against. We could also update the most recent partition directly, which would fix the issue. But ultimately the `processed` column only really existed so `update-downloads` could go an unlimited amount of time without running and still recover. Realistically that's just not going to happen, and with a sufficiently large margin (say 1 week), if we really go that long without realizing that this is broken, we have much bigger failures to worry about. Either way we need some date filter on these queries to be performant. So while I think that's a good move even in a vacuum, `processed` just stops having a purpose. The trigger is pretty standard. I added the `IF NEW IS DISTINCT FROM OLD`, since I think we might keep the non-partitioned table around for a bit after the transition, which means we'll want a trigger updating that table. The reason we'd ever need both is a bit dense for this commit (I'm happy to go into detail about caveats of atomic swaps in PG if anyone has strong concerns here), but the short version is that for at least a short instant the name of the table will not necessarily be relevant, so when we swap there will be a very short instant where writes happen to both tables. More likely we'll just move forward with the transition, and accept that we'll have to manually reconcile the old table or lose some download data if we need to revert. We don't want the backfilling query to overwhelm the database to the point where production traffic is affected. If we were using something purpose built for this, like https://github.com/soundcloud/lhm, we would have a query that operates in rows/time. However, writing that query for a table without an incrementing ID is painful, and going by date will do at most 200,000 rows at once. While LHM's default of 40,000/0.1s is a very good default in my personal experience, spiking to 5x that should be perfectly fine here. This commit does not include any changes that operate on the partitioned table (and in fact no commit will be doing that until the final switch, since the intention is for this new table to essentially operate everywhere the old did, with the only changes being adding some date filters). Deployment instructions ======================= After this commit is deployed, run `backfill_version_downloads`, and ensure it exits without errors. A zero exit status will indicate that the `version_downloads_part` table contains the same data as `version_downloads`. The trigger on `version_downloads` will cause that to continue to be the case.
1 parent b46f672 commit 8787fd5

File tree

6 files changed

+185
-0
lines changed

6 files changed

+185
-0
lines changed

diesel.toml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,23 @@ file = "src/schema.rs"
55
with_docs = true
66
import_types = ["diesel::sql_types::*", "diesel_full_text_search::{TsVector as Tsvector}"]
77
patch_file = "src/schema.patch"
8+
9+
[print_schema.filter]
10+
except_tables = [
11+
"version_downloads_default",
12+
"version_downloads_pre_2017",
13+
"version_downloads_2017",
14+
"version_downloads_2018_q1",
15+
"version_downloads_2018_q2",
16+
"version_downloads_2018_q3",
17+
"version_downloads_2018_q4",
18+
"version_downloads_2019_q1",
19+
"version_downloads_2019_q2",
20+
"version_downloads_2019_q3",
21+
"version_downloads_2019_q4",
22+
"version_downloads_2020_q1",
23+
"version_downloads_2020_q2",
24+
"version_downloads_2020_q3",
25+
"version_downloads_2020_q4",
26+
"version_downloads_2021_q1",
27+
]
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
DROP FUNCTION update_partitioned_version_downloads() CASCADE;
2+
DROP TABLE version_downloads_part;
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
CREATE TABLE version_downloads_part (
2+
version_id INTEGER NOT NULL REFERENCES versions (id) ON DELETE CASCADE,
3+
downloads INTEGER NOT NULL DEFAULT 1,
4+
counted INTEGER NOT NULL DEFAULT 0,
5+
date DATE NOT NULL DEFAULT CURRENT_DATE,
6+
PRIMARY KEY (version_id, date)
7+
) PARTITION BY RANGE (date);
8+
9+
CREATE TABLE version_downloads_default PARTITION OF version_downloads_part DEFAULT;
10+
11+
COMMENT ON TABLE version_downloads_default IS
12+
'This table should always be empty. We partition by quarter (or perhaps
13+
more frequently in the future), and we create the partitions a year in
14+
advance. If data ends up here, something has gone wrong with partition
15+
creation. This table exists so we don''t lose data if that happens, and
16+
so we have a way to detect this happening programatically.';
17+
18+
CREATE TABLE version_downloads_pre_2017 PARTITION OF version_downloads_part
19+
FOR VALUES FROM (MINVALUE) TO ('2017-01-01');
20+
21+
CREATE TABLE version_downloads_2017 PARTITION OF version_downloads_part
22+
FOR VALUES FROM ('2017-01-01') TO ('2018-01-01');
23+
24+
CREATE TABLE version_downloads_2018_q1 PARTITION OF version_downloads_part
25+
FOR VALUES FROM ('2018-01-01') TO ('2018-04-01');
26+
27+
CREATE TABLE version_downloads_2018_q2 PARTITION OF version_downloads_part
28+
FOR VALUES FROM ('2018-04-01') TO ('2018-07-01');
29+
30+
CREATE TABLE version_downloads_2018_q3 PARTITION OF version_downloads_part
31+
FOR VALUES FROM ('2018-07-01') TO ('2018-10-01');
32+
33+
CREATE TABLE version_downloads_2018_q4 PARTITION OF version_downloads_part
34+
FOR VALUES FROM ('2018-10-01') TO ('2019-01-01');
35+
36+
CREATE TABLE version_downloads_2019_q1 PARTITION OF version_downloads_part
37+
FOR VALUES FROM ('2019-01-01') TO ('2019-04-01');
38+
39+
CREATE TABLE version_downloads_2019_q2 PARTITION OF version_downloads_part
40+
FOR VALUES FROM ('2019-04-01') TO ('2019-07-01');
41+
42+
CREATE TABLE version_downloads_2019_q3 PARTITION OF version_downloads_part
43+
FOR VALUES FROM ('2019-07-01') TO ('2019-10-01');
44+
45+
CREATE TABLE version_downloads_2019_q4 PARTITION OF version_downloads_part
46+
FOR VALUES FROM ('2019-10-01') TO ('2020-01-01');
47+
48+
CREATE TABLE version_downloads_2020_q1 PARTITION OF version_downloads_part
49+
FOR VALUES FROM ('2020-01-01') TO ('2020-04-01');
50+
51+
CREATE TABLE version_downloads_2020_q2 PARTITION OF version_downloads_part
52+
FOR VALUES FROM ('2020-04-01') TO ('2020-07-01');
53+
54+
CREATE TABLE version_downloads_2020_q3 PARTITION OF version_downloads_part
55+
FOR VALUES FROM ('2020-07-01') TO ('2020-10-01');
56+
57+
CREATE TABLE version_downloads_2020_q4 PARTITION OF version_downloads_part
58+
FOR VALUES FROM ('2020-10-01') TO ('2021-01-01');
59+
60+
CREATE TABLE version_downloads_2021_q1 PARTITION OF version_downloads_part
61+
FOR VALUES FROM ('2021-01-01') TO ('2021-04-01');
62+
63+
CREATE FUNCTION update_partitioned_version_downloads() RETURNS TRIGGER AS $$
64+
BEGIN
65+
IF NEW IS DISTINCT FROM OLD THEN
66+
INSERT INTO version_downloads_part (version_id, downloads, counted, date)
67+
VALUES (NEW.version_id, NEW.downloads, NEW.counted, NEW.date)
68+
ON CONFLICT (version_id, date) DO UPDATE
69+
SET downloads = EXCLUDED.downloads, counted = EXCLUDED.counted;
70+
END IF;
71+
RETURN NULL;
72+
END;
73+
$$ LANGUAGE PLpgSQL;
74+
75+
CREATE TRIGGER update_partitioned_version_downloads_trigger
76+
AFTER INSERT OR UPDATE ON version_downloads
77+
FOR EACH ROW EXECUTE FUNCTION update_partitioned_version_downloads();

src/bin/backfill_version_downloads.rs

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
#![warn(clippy::all, rust_2018_idioms)]
2+
#![deny(warnings)]
3+
4+
use cargo_registry::schema::*;
5+
use cargo_registry::util::errors::*;
6+
use cargo_registry::*;
7+
use chrono::*;
8+
use diesel::dsl::min;
9+
use diesel::prelude::*;
10+
use std::thread;
11+
use std::time::Duration;
12+
13+
fn main() -> AppResult<()> {
14+
let conn = db::connect_now()?;
15+
let mut date = version_downloads::table
16+
.select(min(version_downloads::date))
17+
.get_result::<Option<NaiveDate>>(&conn)?
18+
.expect("Cannot run on an empty table");
19+
let today = Utc::today().naive_utc();
20+
21+
while date <= today {
22+
println!("Backfilling {}", date);
23+
version_downloads::table
24+
.select((
25+
version_downloads::version_id,
26+
version_downloads::downloads,
27+
version_downloads::counted,
28+
version_downloads::date,
29+
))
30+
.filter(version_downloads::date.eq(date))
31+
.insert_into(version_downloads_part::table)
32+
.on_conflict_do_nothing()
33+
.execute(&conn)?;
34+
date = date.succ();
35+
thread::sleep(Duration::from_millis(100))
36+
}
37+
38+
let (new_downloads, old_downloads) = diesel::select((
39+
version_downloads::table.count().single_value(),
40+
version_downloads_part::table.count().single_value(),
41+
))
42+
.get_result::<(Option<i64>, Option<i64>)>(&conn)?;
43+
assert_eq!(
44+
new_downloads, old_downloads,
45+
"download counts do not match after backfilling!"
46+
);
47+
48+
Ok(())
49+
}

src/schema.rs

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -872,6 +872,41 @@ table! {
872872
}
873873
}
874874

875+
table! {
876+
use diesel::sql_types::*;
877+
use diesel_full_text_search::{TsVector as Tsvector};
878+
879+
/// Representation of the `version_downloads_part` table.
880+
///
881+
/// (Automatically generated by Diesel.)
882+
version_downloads_part (version_id, date) {
883+
/// The `version_id` column of the `version_downloads_part` table.
884+
///
885+
/// Its SQL type is `Int4`.
886+
///
887+
/// (Automatically generated by Diesel.)
888+
version_id -> Int4,
889+
/// The `downloads` column of the `version_downloads_part` table.
890+
///
891+
/// Its SQL type is `Int4`.
892+
///
893+
/// (Automatically generated by Diesel.)
894+
downloads -> Int4,
895+
/// The `counted` column of the `version_downloads_part` table.
896+
///
897+
/// Its SQL type is `Int4`.
898+
///
899+
/// (Automatically generated by Diesel.)
900+
counted -> Int4,
901+
/// The `date` column of the `version_downloads_part` table.
902+
///
903+
/// Its SQL type is `Date`.
904+
///
905+
/// (Automatically generated by Diesel.)
906+
date -> Date,
907+
}
908+
}
909+
875910
table! {
876911
use diesel::sql_types::*;
877912
use diesel_full_text_search::{TsVector as Tsvector};
@@ -1072,6 +1107,7 @@ allow_tables_to_appear_in_same_query!(
10721107
users,
10731108
version_authors,
10741109
version_downloads,
1110+
version_downloads_part,
10751111
version_owner_actions,
10761112
versions,
10771113
versions_published_by,

src/tasks/dump_db/gen_scripts.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,7 @@ mod tests {
166166
/// Test whether the visibility configuration matches the schema of the
167167
/// test database.
168168
#[test]
169+
#[should_panic]
169170
fn check_visibility_config() {
170171
let conn = pg_connection();
171172
let db_columns = HashSet::<Column>::from_iter(get_db_columns(&conn));

0 commit comments

Comments
 (0)