-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DT-821]Possibly fix 401 and 500 error during dataset creation using retries #1865
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! 👍
iamService.registerUser(datasetServiceAccount); | ||
try { | ||
iamService.registerUser(datasetServiceAccount); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something one step down from Exception we can use? Like is there a superclass for ApiException
or something? If there is I'd love to use that, cause this will also catch NPEs and such. If not, no worries and I'm okay moving forwards with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good idea, but, I'm not sure... I think this could throw an InterruptedException which directly extends Exception, or an ApiException which extends InternalServerException which extends ErrorReportException. We could catch one of those last two mentioned exceptions if we don't care about catching the InterruptedException. But, I'm not sure if that is the case, what do you think?
Quality Gate passedIssues Measures |
Jira ticket: https://broadworkbench.atlassian.net/browse/DT-821
Addresses
User issues involving the google access token not being able to be used in CreateDatasetRegisterIngestServiceAccountStep. This is unexpected since the token was just generated and we’d expect it to be valid when calling Sam immediately afterwards. We have a theory that the service account, which was created in the previous step, CreateDatasetCreateIngestServiceAccountStep, is not yet ready to use, pending an incomplete cloud operation internal to the GCS APIs. So, by retrying, we are giving it more time to complete.
Summary of changes
Add a retry for all errors from Sam in the CreateDatasetRegisterIngestServiceAccountStep. I chose to add a retry for all errors instead of just a 401 and a 500 since the issue is transient and cannot be tested right now. If we have seen at least two different error statuses returned from this call, we may see a different error status in the future. Additionally, it is low overhead to retry the step on a failure, even if the retry does not solve the error in all cases.
Testing Strategy
Add unit tests
Question
What is the error thrown by Sam? In this PR, I add a catch for the IamUnauthorized and IamNotFound exceptions and I see other examples doing that in the code (CreateSnapshotSamGroupNameStep), but I also see other places catching ApiExceptions from sam (SnapshotBuilderService)