-
Notifications
You must be signed in to change notification settings - Fork 4k
release-25.4: workload,roachtest: revive TPCH usage #155740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-25.4
Are you sure you want to change the base?
release-25.4: workload,roachtest: revive TPCH usage #155740
Conversation
|
Thanks for opening a backport. Before merging, please confirm that the change does not break backwards compatibility and otherwise complies with the backport policy. Include a brief release justification in the PR description explaining why the backport is appropriate. All backports must be reviewed by the TL for the owning area. While the stricter LTS policy does not yet apply, please exercise judgment and consider gating non-critical changes behind a disabled-by-default feature flag when appropriate. |
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
e924a53 to
7b09b48
Compare
I don't think we need `vectorize` and `default-vectorize` flags anymore (they were introduced long time ago when the vectorized engine had been under active development). Additionally, we forgot to remove the last mention of already removed `distsql` flag. Release note: None
`enable-checks` flag of TPCH workload allows for checking actual query results against the hard-coded set of expected results (which were obtained against the "golden" TPCH backup about 6 years ago). In order to avoid hard-coding thousands of rows, for queries 11 and 16 we allowed only checking the row count while all other queries received exact comparison check. However, in 852a2b3 we broke this logic: namely, during the PR review, the code was refactored in such a way as to include the expected row count for each query into the map, and as a result we stopped doing the exact comparison check (since each query had a number in `numExpectedRowsByQueryNumber` map) and only did the row count check for all queries. This commit fixes that oversight - we now only include queries 11 and 16 into the map, so all others get the exact comparison check. Release note: None
Before the fix in the previous commit, whenever `enable-checks` flag was specified, we only did the row count check (i.e. verified that a TPCH query returned the expected number of rows). I briefly looked into whether we could make the exact comparison check somewhat universal (as opposed to only work against the "golden" TPCH backup (which we can no longer restore)) but decided it's not worth the effort. Given that we've lived with row count check only for many years now, this commit embraces the effect of that bug, so it refactors the `tpch` workload to only do the row count check (for 17 out of 22 queries for which the number of rows returned doesn't seem to change due to a different fixture seed or scale factor). As a small win, this check is now done unconditionally, so `enable-checks` flag is now removed. This patch will allow us to switch to using `import fixtures` command instead of trying to restore the "golden" TPCH backup. Release note: None
When we implemented tpch dataload, we copied the logic to add FKs NOT VALID (in order to speed up the test) over from tpcc. I don't think we are as concerned about the performance of the initial ingestion of tpch dataset, so this commit makes these ALTER stmts add FKs validated right away. In my local reproduction when importing Scale Factor 1, it took about 1 minute for the data ingestion and on the order of 40s for FK validation, which seems acceptable. On the plus side, now imported fixtures match the schema that we got with previously restoring the "golden" TPCH backup. Additionally, one of the FKs was commented out due to an error, but we're not hitting that anymore, so it's now also added. Furthermore, 3 spots where we hard-code the TPCH schema have been updated to include validated FKs as well as the commented out one. Release note: None
tpcc, tpch, and movr workloads add foreign keys, and since PostLoad step must be idempotent, they need to ignore errors when a FK already exists. Previously, we used the old error message, and this commit fixes that oversight. Release note: None
For a long time we've been using the "golden" TPCH backup which can no longer be restored (since it was created 5+ years ago). Given the change in how we do "checks" in TPCH workload, we can now switch to using fixtures easily. Many roachtests owned by Queries are affected in addition to the schemachange test (which uses Scale Factor of 10) - it has now been refactored to use the same helper. The helper has been extended to also support secondary tenants. Additionally, I re-generated a new backup with 25.3 version for scale factor of 1 - this is now utilized only by sqlsmith roachtest (which wants to restore into `defaultdb` that doesn't seem to be easy with fixtures since they hard-code the database to be equal to the name of the fixture). tpcc and tpcds backups have also been regenerated. In terms of time difference, restoring the backup takes about 20s whereas importing the fixture with validated FKs (plus collecting table stats) takes about 2m 15s. This seems acceptable. I decided to remove `import-sf10.sql` script since it's effectively a copy of `sf1` and doesn't seem worth maintaining separately. Release note: None
When we added the fixtures generation of TPCH spec, we forgot a couple of values - one for "priorities" and one for "containers". This commit fixes that omission. In particular, missing priority value made us return 4 rows instead of 5 for Q4, and the expectation has been adjusted accordingly. Note that this omission was exposed via a roachtest that uses correctly generated fixtures of SF 100. Release note: None
7b09b48 to
dca5a9f
Compare
Backport 6/6 commits from #155135.
Backport 1/1 commits from #156002.
/cc @cockroachdb/release
This PR contains several commits that revive our usage of TPCH dataset. Most of roachtests are now switched to using
workload fixtures importcommand.tpchworkload itself has been adjusted to no longer hard-code "expected rows" for the "golden" TPCH backup. See each commit for details.Fixes: #153489.
Fixes: #153566.
Fixes: #154173.
Release justification: effectively test-only change.