-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DCJ-575] Dataset obtained for ingest flight needn't have relationships or assets #1754
Conversation
These are obtained via costly database calls.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes would be good fits for a @ParameterizedTest
but we haven't yet converted our @EmbeddedDatabaseTest
s to JUnit 5 so I'm designating that improvement OOS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the targeted nature of these changes. Tests look reasonable 👍🏽
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have comments, but it looks OK and I think it's good to get a fix out sooner vs addressing all comments.
Dataset dataset = null; | ||
try { | ||
if (summary != null) { | ||
summary.storage(storageResourceDao.getStorageResourcesByDatasetId(summary.getId())); | ||
dataset = new Dataset(summary); | ||
dataset.tables(tableDao.retrieveTables(dataset.getId())); | ||
relationshipDao.retrieve(dataset); | ||
assetDao.retrieve(dataset); | ||
if (retrieveRelationship) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why these weren't joins in the first place instead of separate db calls. I also wonder how many other times we're making multiple db calls in a case where a single query could be used.
But I agree that this is probably the simplest fix for this problem right now
* @param id in UUID format | ||
* @return a Dataset object | ||
*/ | ||
public Dataset retrieveForIngest(UUID id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have a way to get a dataset without relationships, could we put this dataset in the flight map? It looked like the problem with serialization before was due to the relationship structure, due its recursive definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can. There were concerns raised in Datahood Design Discussions that loading the dataset into the flight map could be problematic if the dataset changes between its storage in the map and its usage later on in the flight… but we could guard against that by loading it into the working map after locking the dataset.
I thought this would be worth pursuing if / when we modify the dataset types as discussed below.
@@ -483,6 +483,11 @@ public boolean delete(UUID id) { | |||
return rowsAffected > 0; | |||
} | |||
|
|||
public Dataset retrieve(UUID id, boolean retrieveRelationship, boolean retrieveAsset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a hazard to have a single "dataset" object whose contents may be initialized differently depending on how it's created, as we don't keep track of which type of initialization was used.
A stricter approach would either store the retrieval flags in the dataset and then throw if an invalid getter was called, or (even better) having different dataset types depending on which fields are present, so invalid access would be known at compile time. Both of these are out of scope for this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I could see this biting us down the line while also seeing the current approach as good for now. I'd like to talk with you more about alternative approaches here -- I had also thought about different dataset types but would appreciate a thought partner to solidify what that would look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with many of @pshapiro4broad 's suggestions but I think they're out of scope for this PR. In terms of a much more narrow scoped PR I think this one is in good shape!
Thanks all for the reviews and feedback -- @pshapiro4broad , @fboulnois , @rushtong . I've filed this follow-on ticket for future work and will merge this PR now: https://broadworkbench.atlassian.net/browse/DCJ-600 |
Jira ticket: https://broadworkbench.atlassian.net/browse/DCJ-575
Addresses
By a long shot, the highest drivers of database load during this Prod incident were queries to construct dataset assets and relationships, called as part of constructing full dataset objects from the DB. We deduced that the high number of calls happened while facilitating a high number (300+) of concurrent ingest flights. The size of the ingests had no bearing on this load.
Dataset assets and relationships are not needed to facilitate ingestion, so that load on the database is unnecessary even before you consider whether the queries themselves could be optimized.
As we continue to see DB CPU utilization spikes up to 100% despite doubling our Prod DB's available CPU, we want to pursue an attainable optimization ASAP. Further iteration may be explored later, but my priority with this PR is to get us more breathing room to prioritize such work thoughtfully.
Summary of changes
IngestUtils.getDataset(FlightContext, DatasetService)
still constructs the latest version of the Dataset from the DB, but now avoids the costly DB calls to construct relationships and assets.Testing Strategy
Expanded unit tests.
Examined Query Insights for integration test runs (which exercise ingestion, though not at high levels of concurrency) and observed reduced DB CPU utilization attributed to relationship and asset construction.
Integration tests off of this PR:
![Screenshot 2024-08-01 at 3 21 35 PM](https://private-user-images.githubusercontent.com/79769153/356759165-ce5491bb-e6b2-4ca8-9112-1f6e2b133e1c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNTY4NzgsIm5iZiI6MTczOTE1NjU3OCwicGF0aCI6Ii83OTc2OTE1My8zNTY3NTkxNjUtY2U1NDkxYmItZTZiMi00Y2E4LTkxMTItMWY2ZTJiMTMzZTFjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDAzMDI1OFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTcwZTA4ZDk4YTkwYWE3ODRhM2JkYWJhNzI5NWE0OWMyZDBlYjYwYzZkM2ZiMjliOGJhOWY4MGIyNDllYzE5MjUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.UPZgCWi2hpgGnSi1mKLTZSOuNLSMn8Ymsruly83MYWw)
Integration tests off of
![Screenshot 2024-08-01 at 3 19 58 PM](https://private-user-images.githubusercontent.com/79769153/356759162-ee921020-572e-424a-8e85-f13cb0dd5c55.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNTY4NzgsIm5iZiI6MTczOTE1NjU3OCwicGF0aCI6Ii83OTc2OTE1My8zNTY3NTkxNjItZWU5MjEwMjAtNTcyZS00MjRhLThlODUtZjEzY2IwZGQ1YzU1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDAzMDI1OFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWFjODBjMjQ1N2RhOWU3NDJmOGE4YjI0OTg3MDAwNTc0YWUxYTEwMjE3MjA2YTI3MGNiY2NlMWUzMzNlMzE3MmUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.W7Sw76axks6FfaHNMs8zrQZ2oDWKSqbam9JqJvmipZE)
develop
branch (control):