-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WOR-1544] #major Populate MDC for flight and its steps #125
Conversation
Wrap flight thread execution so that flight threads are context-aware. Logs emitted during a flight thread now have the context from the calling thread, along with flight-specific context (flight ID and class). Logs emitted during the execution of a step additionally include step-specific context (step class, direction, and index). Handles the case where a flight thread spawns a child flight by intentionally cleaning any leftover step-specific context from the calling thread. Drive-by: now that we include flight ID and class on all flight logs in the MDC, we can trim down the dismal failure log line.
In a follow-on PR, will automate spotlessCheck to gate PRs.
flightContext.getFlightClassName(), | ||
flightContext.getStepIndex(), | ||
flightContext.getStepClassName()); | ||
"{} (index {}) experienced DISMAL FAILURE: non-retryable error on undo", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive-by: now that we include flight ID and class on all flight logs in the MDC, we can trim down the dismal failure log line.
I also reformatted to lead with the failing step class based off of Stairway retro feedback that all of the useful information on a log line is at the end. Do we like this better? Unfortunately we don't get step context in MDC for free at this point -- we're outside of step execution.
And some other formatting and documentation changes.
The MdcHelper methods are only meant to be called within the Stairway implementation package.
Stairway leverages the underlying logging system's mapped diagnostic context (MDC) if available. | ||
|
||
MDC manages contextual information on a per-thread basis: as Stairway submits a Flight for processing, | ||
it **passes along the MDC of the calling thread** so that context isn't lost. It also further populates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Propagating the MDC from the calling thread makes me want to designate this as a minor version bump at minimum. Default behavior in this repo is patch bumps.
Reviewers, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with a more significant version bump. Though with all the dependency upgrades underway, I was already thinking we would do a major version bump very soon. Perhaps just do a major bump with this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I'll make it a major one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added #major to the PR title, which will get pulled into the commit message.
@@ -16,8 +16,7 @@ dependencies { | |||
implementation group: 'org.slf4j', name: 'slf4j-api', version: '1.7.36' | |||
|
|||
// For testing | |||
testImplementation group: 'org.junit.jupiter', name: 'junit-jupiter-api', version: '5.9.0' | |||
testImplementation group: 'org.junit.jupiter', name: 'junit-jupiter-engine', version: '5.9.0' | |||
testImplementation group: 'org.junit.jupiter', name: 'junit-jupiter', version: '5.10.2' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed this expanded dependency for parameterized tests, and upgraded while in the neighborhood.
Map<String, String> contextMap = MDC.getCopyOfContextMap(); | ||
return () -> { | ||
// Any leftover context on the thread will be fully overwritten: | ||
MDC.setContextMap(Optional.ofNullable(contextMap).orElse(Map.of())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble understanding this line. We are getting a copy of the context map at line 35, so why isn't line 38 effectively a no-op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, great question. Stairway spawns new threads for flight execution, so this Runnable within may be happening on a new thread without the prior context.
But this tells me that this code would be better organized if it's added to the Stairway thread pool implementation instead. I'll make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cahrens , I reorganized this code to live in StairwayThreadPool
and feel like it makes more sense now, thanks for the original comment. Would you mind taking another look?
Because Stairway flights are submitted to the StairwayThreadPool for execution, it makes more sense to update the threadpool to wrap task submission to handle setting up and clearing flight context for the duration of the flight run on the child thread. MdcHelper becomes MdcUtils, now a home for less involved utility methods for modifying the MDC. Expanded tests accordingly: StairwayThreadPool did not previously have targeted unit tests.
int active = activeTasks.incrementAndGet(); | ||
logger.debug("before: " + active); | ||
super.beforeExecute(t, r); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the docstring on this method, it's recommended to call the super at the end to allow for potential nesting.
(I also read the docstring on afterExecute
, and the super call at the beginning matches the recommendation.)
Removed checks on stairwayThreadPool.getActiveFlights, which is only used in debug logging.
@@ -47,7 +47,7 @@ jobs: | |||
distribution: 'temurin' | |||
cache: 'gradle' | |||
- name: Run tests | |||
run: ./gradlew test | |||
run: ./gradlew test --scan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generates a Gradle build scan as part of test output, which makes debugging tests a lot easier. Example from the most recent unit test run:
Publishing build scan...
https://gradle.com/s/gu4jckjm4mqzq
And revisiting MdcUtilsTest with some naming changes for clarity.
|
https://broadworkbench.atlassian.net/browse/WOR-1544
Centralizing work done in DataBiosphere/terra-workspace-manager#1626, wrap flight thread execution so that flight threads are context-aware.
Handles the case where a flight thread spawns a child flight by intentionally cleaning any leftover step-specific context from the calling thread.
In Action
I published my Stairway branch locally, pulled it in as a WSM dependency, removed the existing WSM code which modifies the MDC, deployed it to my Bee, and hit an endpoint that triggers a stairway flight.
The resulting logs have their MDC populated as described in this PR: https://cloudlogging.app.goo.gl/T9vdJ86RXRoa117VA
jsonPayload.requestId
comes from)A
![Screenshot 2024-02-29 at 7 44 47 PM](https://private-user-images.githubusercontent.com/79769153/309113032-15e6be29-fcc9-4813-ba8c-27ec0162e1fd.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk2MTY3MDcsIm5iZiI6MTczOTYxNjQwNywicGF0aCI6Ii83OTc2OTE1My8zMDkxMTMwMzItMTVlNmJlMjktZmNjOS00ODEzLWJhOGMtMjdlYzAxNjJlMWZkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE1VDEwNDY0N1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTQxMWYyZTM5ODhjOTkwZDA0ZWUzMjI1ZTNkMTY3ZWQyNDM1MTNlYWE2NDAzM2FiNzhkZWQ0NDU4MGQ5ZmZlOGYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.a2Xx2EAyvpBZipSiV3tRJc1pHPk8Cm1sTrBy5c_l9Jw)
jsonPayload
for a log emitted during a step:All of the logs for the request, with
![Screenshot 2024-02-29 at 7 46 57 PM](https://private-user-images.githubusercontent.com/79769153/309113229-cb07f652-4a9e-4002-ba6b-27fc7ececc13.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk2MTY3MDcsIm5iZiI6MTczOTYxNjQwNywicGF0aCI6Ii83OTc2OTE1My8zMDkxMTMyMjktY2IwN2Y2NTItNGE5ZS00MDAyLWJhNmItMjdmYzdlY2VjYzEzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE1VDEwNDY0N1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTdjNjdkMjAxYjJhYjk2ZDM5ZWJmODQ5ZjViOGI0MmJiNGEyNmRiMTVkYjk5NjMzYWMyMzBlODVjN2RmNzhmM2QmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.5MpHG3aNg5ECsEWP0gde0ZjnqrJlelro9MRIrZJPj3Y)
flightId
,flightStepNumber
, andflightStepDirection
added to the summary line:(As an aside, it's clear to me that these log lines can be substantially trimmed. Do we need the full formatted summary anymore if we have this info in the MDC, and can pull it in the summary line?)
How can my service get this new functionality?
Once this PR is merged, I'll update the dependency in TCL, which is where most services get Stairway from. At any rate, downstream services will need to update a dependency.
If downstream services presently modify their MDCs themselves, they will be able to remove that code in their service and let Stairway take care of it.