Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DCJ-575] Dataset obtained for ingest flight needn't have relationships or assets #1754

Merged
merged 4 commits into from
Aug 12, 2024

Conversation

okotsopoulos
Copy link
Contributor

@okotsopoulos okotsopoulos commented Aug 1, 2024

Jira ticket: https://broadworkbench.atlassian.net/browse/DCJ-575

Addresses

By a long shot, the highest drivers of database load during this Prod incident were queries to construct dataset assets and relationships, called as part of constructing full dataset objects from the DB. We deduced that the high number of calls happened while facilitating a high number (300+) of concurrent ingest flights. The size of the ingests had no bearing on this load.

Dataset assets and relationships are not needed to facilitate ingestion, so that load on the database is unnecessary even before you consider whether the queries themselves could be optimized.

As we continue to see DB CPU utilization spikes up to 100% despite doubling our Prod DB's available CPU, we want to pursue an attainable optimization ASAP. Further iteration may be explored later, but my priority with this PR is to get us more breathing room to prioritize such work thoughtfully.

Summary of changes

IngestUtils.getDataset(FlightContext, DatasetService) still constructs the latest version of the Dataset from the DB, but now avoids the costly DB calls to construct relationships and assets.

Testing Strategy

Expanded unit tests.

Examined Query Insights for integration test runs (which exercise ingestion, though not at high levels of concurrency) and observed reduced DB CPU utilization attributed to relationship and asset construction.

Integration tests off of this PR:
Screenshot 2024-08-01 at 3 21 35 PM

Integration tests off of develop branch (control):
Screenshot 2024-08-01 at 3 19 58 PM

@okotsopoulos okotsopoulos changed the title [WIP] Dataset obtained for ingest flight needn't have relationships or assets [DCJ-575] Dataset obtained for ingest flight needn't have relationships or assets Aug 9, 2024
Copy link

sonarqubecloud bot commented Aug 9, 2024

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes would be good fits for a @ParameterizedTest but we haven't yet converted our @EmbeddedDatabaseTests to JUnit 5 so I'm designating that improvement OOS.

@okotsopoulos okotsopoulos marked this pull request as ready for review August 9, 2024 21:44
@okotsopoulos okotsopoulos requested review from a team as code owners August 9, 2024 21:44
@okotsopoulos okotsopoulos requested review from rushtong and fboulnois and removed request for a team August 9, 2024 21:44
Copy link
Contributor

@rushtong rushtong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the targeted nature of these changes. Tests look reasonable 👍🏽

Copy link
Member

@pshapiro4broad pshapiro4broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have comments, but it looks OK and I think it's good to get a fix out sooner vs addressing all comments.

Dataset dataset = null;
try {
if (summary != null) {
summary.storage(storageResourceDao.getStorageResourcesByDatasetId(summary.getId()));
dataset = new Dataset(summary);
dataset.tables(tableDao.retrieveTables(dataset.getId()));
relationshipDao.retrieve(dataset);
assetDao.retrieve(dataset);
if (retrieveRelationship) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why these weren't joins in the first place instead of separate db calls. I also wonder how many other times we're making multiple db calls in a case where a single query could be used.

But I agree that this is probably the simplest fix for this problem right now

* @param id in UUID format
* @return a Dataset object
*/
public Dataset retrieveForIngest(UUID id) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have a way to get a dataset without relationships, could we put this dataset in the flight map? It looked like the problem with serialization before was due to the relationship structure, due its recursive definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can. There were concerns raised in Datahood Design Discussions that loading the dataset into the flight map could be problematic if the dataset changes between its storage in the map and its usage later on in the flight… but we could guard against that by loading it into the working map after locking the dataset.

I thought this would be worth pursuing if / when we modify the dataset types as discussed below.

@@ -483,6 +483,11 @@ public boolean delete(UUID id) {
return rowsAffected > 0;
}

public Dataset retrieve(UUID id, boolean retrieveRelationship, boolean retrieveAsset) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a hazard to have a single "dataset" object whose contents may be initialized differently depending on how it's created, as we don't keep track of which type of initialization was used.

A stricter approach would either store the retrieval flags in the dataset and then throw if an invalid getter was called, or (even better) having different dataset types depending on which fields are present, so invalid access would be known at compile time. Both of these are out of scope for this PR though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I could see this biting us down the line while also seeing the current approach as good for now. I'd like to talk with you more about alternative approaches here -- I had also thought about different dataset types but would appreciate a thought partner to solidify what that would look like.

Copy link
Contributor

@fboulnois fboulnois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with many of @pshapiro4broad 's suggestions but I think they're out of scope for this PR. In terms of a much more narrow scoped PR I think this one is in good shape!

@okotsopoulos
Copy link
Contributor Author

Thanks all for the reviews and feedback -- @pshapiro4broad , @fboulnois , @rushtong . I've filed this follow-on ticket for future work and will merge this PR now: https://broadworkbench.atlassian.net/browse/DCJ-600

@okotsopoulos okotsopoulos merged commit 7fc9295 into develop Aug 12, 2024
13 checks passed
@okotsopoulos okotsopoulos deleted the okotsopo-dataset-ingest-fewer-db-calls branch August 12, 2024 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants