Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DT-821]Possibly fix 401 and 500 error during dataset creation using retries #1865

Merged
merged 5 commits into from
Dec 11, 2024

Conversation

rjohanek
Copy link
Contributor

@rjohanek rjohanek commented Dec 6, 2024

Jira ticket: https://broadworkbench.atlassian.net/browse/DT-821

Addresses

User issues involving the google access token not being able to be used in CreateDatasetRegisterIngestServiceAccountStep. This is unexpected since the token was just generated and we’d expect it to be valid when calling Sam immediately afterwards. We have a theory that the service account, which was created in the previous step, CreateDatasetCreateIngestServiceAccountStep, is not yet ready to use, pending an incomplete cloud operation internal to the GCS APIs. So, by retrying, we are giving it more time to complete.

Summary of changes

Add a retry for all errors from Sam in the CreateDatasetRegisterIngestServiceAccountStep. I chose to add a retry for all errors instead of just a 401 and a 500 since the issue is transient and cannot be tested right now. If we have seen at least two different error statuses returned from this call, we may see a different error status in the future. Additionally, it is low overhead to retry the step on a failure, even if the retry does not solve the error in all cases.

Testing Strategy

Add unit tests

Question

What is the error thrown by Sam? In this PR, I add a catch for the IamUnauthorized and IamNotFound exceptions and I see other examples doing that in the code (CreateSnapshotSamGroupNameStep), but I also see other places catching ApiExceptions from sam (SnapshotBuilderService)

@rjohanek rjohanek requested a review from a team as a code owner December 6, 2024 20:03
@rjohanek rjohanek requested review from snf2ye and s-rubenstein and removed request for a team December 6, 2024 20:03
Copy link
Contributor

@fboulnois fboulnois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! 👍

@rjohanek rjohanek changed the title [DT-821]Possibly fix 401 and 404 error during dataset creation using retries [DT-821]Possibly fix 401 and 500 error during dataset creation using retries Dec 10, 2024
iamService.registerUser(datasetServiceAccount);
try {
iamService.registerUser(datasetServiceAccount);
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something one step down from Exception we can use? Like is there a superclass for ApiException or something? If there is I'd love to use that, cause this will also catch NPEs and such. If not, no worries and I'm okay moving forwards with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good idea, but, I'm not sure... I think this could throw an InterruptedException which directly extends Exception, or an ApiException which extends InternalServerException which extends ErrorReportException. We could catch one of those last two mentioned exceptions if we don't care about catching the InterruptedException. But, I'm not sure if that is the case, what do you think?

@rjohanek rjohanek enabled auto-merge (squash) December 11, 2024 16:41
@rjohanek rjohanek merged commit b27a0f3 into develop Dec 11, 2024
14 checks passed
@rjohanek rjohanek deleted the rj/dt-821-add-401-retry branch December 11, 2024 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants