-
Notifications
You must be signed in to change notification settings - Fork 9
Trigger a profile before terminating worker upon timeout #120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/ReTestItems.jl
Outdated
@@ -554,7 +554,8 @@ function manage_worker( | |||
# Handle the exception | |||
if e isa TimeoutException | |||
@debugv 1 "Test item $(repr(testitem.name)) timed out. Terminating worker $worker" | |||
terminate!(worker) | |||
trigger_profile(worker, :timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would probably be nice to make it configurable whether or not the trigger_profile is run, and with that the amount of time to wait (currently you have it at 120) to do the profile printing.
# `JULIA_PROFILE_PEEK_HEAP_SNAPSHOT`, `Profile.set_peek_duration`, `Profile.peek_report[]` | ||
# See https://docs.julialang.org/en/v1/stdlib/Profile/#Triggered-During-Execution | ||
function trigger_profile(w::Worker, from::Symbol=:manual) | ||
if !Sys.iswindows() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not windows? because there isn't a SIGINFO equivalent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! Looks like i did this because upstream Julia doesn't support this functionality on Windows... but idk why not
This works for me locally, not sure whats causing the failures on CI. From the errors it seem to me that we're not flushing the profile contents to the file. Perhaps no yield point occurred before the worker crashed? But why it would be different locally... This remains a WIP, but if people have ideas, I'd be happy to hear them. Otherwise I'm going to look at the signal handling and profiling code some more. |
I think is ready for review, cc: @nickrobinson251. One unfortunate thing is that since we moved the log printing past the point where we kill the worker, we now also include the stacktrace from when the worker dies. Hopefully that is not too bad, since at that point the worker was timing out, so the stacktrace might be relevant, but I can see how it could be confusing to people. See this example: Click to see the logs
|
yeah, too bad about the SIGNAL 15, but hopefully that can be resolved by #152 if we can ever get that PR fixed up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small question and suggestion only.
@@ -253,13 +259,14 @@ function runtests( | |||
nworkers = max(0, nworkers) | |||
retries = max(0, retries) | |||
timeout = ceil(Int, testitem_timeout) | |||
timeout_profile_wait = ceil(Int, timeout_profile_wait) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will throw an InexactError
if you pass Inf
. Maybe that's fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah Inf
s, NaN
s and negative values will throw at some point, but I don't think this is an issue in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a check for the negative numbers as we do the same thing for timeout
Fixes #105