Skip to content

Conversation

nickrobinson251
Copy link
Member

@nickrobinson251 nickrobinson251 commented Nov 11, 2023

Fixes #105

@nickrobinson251 nickrobinson251 marked this pull request as draft November 20, 2023 23:06
@@ -554,7 +554,8 @@ function manage_worker(
# Handle the exception
if e isa TimeoutException
@debugv 1 "Test item $(repr(testitem.name)) timed out. Terminating worker $worker"
terminate!(worker)
trigger_profile(worker, :timeout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would probably be nice to make it configurable whether or not the trigger_profile is run, and with that the amount of time to wait (currently you have it at 120) to do the profile printing.

# `JULIA_PROFILE_PEEK_HEAP_SNAPSHOT`, `Profile.set_peek_duration`, `Profile.peek_report[]`
# See https://docs.julialang.org/en/v1/stdlib/Profile/#Triggered-During-Execution
function trigger_profile(w::Worker, from::Symbol=:manual)
if !Sys.iswindows()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not windows? because there isn't a SIGINFO equivalent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Looks like i did this because upstream Julia doesn't support this functionality on Windows... but idk why not

@Drvi
Copy link
Collaborator

Drvi commented Apr 22, 2024

This works for me locally, not sure whats causing the failures on CI. From the errors it seem to me that we're not flushing the profile contents to the file. Perhaps no yield point occurred before the worker crashed? But why it would be different locally...

This remains a WIP, but if people have ideas, I'd be happy to hear them. Otherwise I'm going to look at the signal handling and profiling code some more.

@Drvi
Copy link
Collaborator

Drvi commented Apr 27, 2024

I think is ready for review, cc: @nickrobinson251. One unfortunate thing is that since we moved the log printing past the point where we kill the worker, we now also include the stacktrace from when the worker dies. Hopefully that is not too bad, since at that point the worker was timing out, so the stacktrace might be relevant, but I can see how it could be confusing to people. See this example:
EDIT: the profiles are no longer truncated

Click to see the logs
[ Info: Scanning for test items in project `ReTestItems` at paths: /home/tomasd/.julia/dev/ReTestItems/test/testfiles/_timeout_tests.jl
[ Info: Finished scanning for test items in 0.02 seconds. Scheduling 2 tests on pid 93068 with 1 worker processes and 3 threads per worker.
[ Info: Starting test workers
  Worker 93319:  [ Info: Starting test worker 1 on pid = 93319, with 3 threads
[ Info: Starting running test items
  Worker 93319:  14:20:35 | maxrss  0.8% | mem 31.8% | START (1/2) test item "Test item takes 60 seconds" at test/testfiles/_timeout_tests.jl:1

Captured Logs for test item "Test item takes 60 seconds" at test/testfiles/_timeout_tests.jl:1 on worker 93319

======================================================================================
Information request received. A stacktrace will print followed by a 1.0 second profile
======================================================================================

cmd: /home/tomasd/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/bin/julia 93319 running 0 of 3

signal (10): User defined signal 1
unknown function (ip: 0x7c19c3291115)
pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:883
ijl_task_get_next at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/partr.c:596
poptask at ./task.jl:921
wait at ./task.jl:930
task_done_hook at ./task.jl:634
jfptr_task_done_hook_49067.clone_1 at /home/tomasd/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2377 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2559
jl_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/julia.h:1843 [inlined]
jl_finish_task at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/task.c:254
jl_threadfun at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/partr.c:305
unknown function (ip: 0x7c19c3294ac2)
unknown function (ip: 0x7c19c332684f)
unknown function (ip: (nil))
unknown function (ip: 0x7c19c3291115)
pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:883
ijl_task_get_next at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/partr.c:596
poptask at ./task.jl:921
wait at ./task.jl:930
wait at ./condition.jl:124
_trywait at ./asyncevent.jl:138
wait at ./asyncevent.jl:155 [inlined]
sleep at ./asyncevent.jl:240
jfptr_sleep_39693.clone_1 at /home/tomasd/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2377 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2559
jl_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/julia.h:1843 [inlined]
do_call at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:126
eval_value at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:215
eval_stmt_value at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:166 [inlined]
eval_body at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:612
jl_interpret_toplevel_thunk at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:750
jl_toplevel_eval_flex at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/toplevel.c:906
jl_eval_module_expr at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/toplevel.c:203 [inlined]
jl_toplevel_eval_flex at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/toplevel.c:709
ijl_toplevel_eval_in at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/toplevel.c:965
eval at ./boot.jl:368 [inlined]
#87 at /home/tomasd/.julia/dev/ReTestItems/src/ReTestItems.jl:1000 [inlined]
with_source_path at /home/tomasd/.julia/dev/ReTestItems/src/ReTestItems.jl:831
#86 at /home/tomasd/.julia/dev/ReTestItems/src/ReTestItems.jl:1000 [inlined]
#redirect_stdio#683 at ./stream.jl:1411
redirect_stdio##kw at ./stream.jl:1374 [inlined]
_redirect_logs at /home/tomasd/.julia/dev/ReTestItems/src/log_capture.jl:122
#27 at /home/tomasd/.julia/dev/ReTestItems/src/log_capture.jl:107 [inlined]
#open#378 at ./io.jl:384
open at ./io.jl:381 [inlined]
_redirect_logs at /home/tomasd/.julia/dev/ReTestItems/src/log_capture.jl:107 [inlined]
macro expansion at /home/tomasd/.julia/dev/ReTestItems/src/macros.jl:82 [inlined]
#runtestitem#85 at /home/tomasd/.julia/dev/ReTestItems/src/ReTestItems.jl:999
runtestitem##kw at /home/tomasd/.julia/dev/ReTestItems/src/ReTestItems.jl:929
unknown function (ip: 0x7c19c1a2a985)
_jl_invoke at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2377 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2559
jl_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/julia.h:1843 [inlined]
do_call at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:126
eval_value at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:215
eval_stmt_value at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:166 [inlined]
eval_body at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:612
jl_interpret_toplevel_thunk at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:750
jl_toplevel_eval_flex at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/toplevel.c:906
ijl_toplevel_eval_in at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/toplevel.c:965
eval at ./boot.jl:368 [inlined]
execute at /home/tomasd/.julia/dev/ReTestItems/src/workers.jl:333 [inlined]
macro expansion at /home/tomasd/.julia/dev/ReTestItems/src/workers.jl:347 [inlined]
#24 at ./threadingconstructs.jl:258
unknown function (ip: 0x7c19c1a10bd4)
_jl_invoke at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2377 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2559
jl_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/julia.h:1843 [inlined]
start_task at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/task.c:931
unknown function (ip: (nil))
epoll_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
uv__io_poll at /workspace/srcdir/libuv/src/unix/epoll.c:236
uv_run at /workspace/srcdir/libuv/src/unix/core.c:400
ijl_task_get_next at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/partr.c:565
poptask at ./task.jl:921
wait at ./task.jl:930
wait at ./condition.jl:124
wait_readnb at ./stream.jl:416
eof at ./stream.jl:106
read at ./stream.jl:978
deserialize at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:813 [inlined]
deserialize at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:800
serve_requests at /home/tomasd/.julia/dev/ReTestItems/src/workers.jl:339
startworker at /home/tomasd/.julia/dev/ReTestItems/src/workers.jl:309
unknown function (ip: 0x7c19c1a0ae1f)
_jl_invoke at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2377 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2559
jl_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/julia.h:1843 [inlined]
do_call at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:126
eval_value at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:215
eval_stmt_value at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:166 [inlined]
eval_body at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:612
jl_interpret_toplevel_thunk at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:750
jl_toplevel_eval_flex at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/toplevel.c:906
jl_toplevel_eval_flex at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/toplevel.c:850
jl_toplevel_eval_flex at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/toplevel.c:850
ijl_toplevel_eval_in at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/toplevel.c:965
eval at ./boot.jl:368 [inlined]
exec_options at ./client.jl:276
_start at ./client.jl:522
jfptr__start_38041.clone_1 at /home/tomasd/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2377 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2559
jl_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/julia.h:1843 [inlined]
true_main at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/jlapi.c:575
jl_repl_entrypoint at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/jlapi.c:719
main at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/cli/loader_exe.c:59
unknown function (ip: 0x7c19c3229d8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401098)
unknown function (ip: (nil))

==============================================================
Profile collected. A report will print at the next yield point
==============================================================


signal (15): Terminated
in expression starting at /home/tomasd/.julia/dev/ReTestItems/test/testfiles/_timeout_tests.jl:2
unknown function (ip: 0x7c19c3291115)
pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:883
ijl_task_get_next at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/partr.c:596
poptask at ./task.jl:921
wait at ./task.jl:930
task_done_hook at ./task.jl:634
jfptr_task_done_hook_49067.clone_1 at /home/tomasd/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2377 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2559
jl_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/julia.h:1843 [inlined]
jl_finish_task at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/task.c:254
start_task at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/task.c:942
unknown function (ip: (nil))
unknown function (ip: 0x7c19c3291115)
pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:883
ijl_task_get_next at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/partr.c:596
poptask at ./task.jl:921
wait at ./task.jl:930
wait at ./condition.jl:124
_trywait at ./asyncevent.jl:138
wait at ./asyncevent.jl:155 [inlined]
profile_printing_listener at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/Profile/src/Profile.jl:39
#3 at ./threadingconstructs.jl:258
jfptr_YY.3_63490.clone_1 at /home/tomasd/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2377 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2559
jl_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/julia.h:1843 [inlined]
start_task at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/task.c:931
unknown function (ip: (nil))
epoll_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
uv__io_poll at /workspace/srcdir/libuv/src/unix/epoll.c:236
uv_run at /workspace/srcdir/libuv/src/unix/core.c:400
ijl_task_get_next at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/partr.c:565
poptask at ./task.jl:921
wait at ./task.jl:930
task_done_hook at ./task.jl:634
jfptr_task_done_hook_49067.clone_1 at /home/tomasd/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2377 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2559
jl_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/julia.h:1843 [inlined]
jl_finish_task at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/task.c:254
start_task at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/task.c:942
unknown function (ip: (nil))
Allocations: 3923557 (Pool: 3921956; Big: 1601); GC: 4
Overhead ╎ [+additional indent] Count File:Line; Function
=========================================================
Thread 1 Task 0x00007c19b6e00010 Total snapshots: 641. Utilization: 0%
   ╎641 @Base/client.jl:522; _start()
   ╎ 641 @Base/client.jl:276; exec_options(opts::Base.JLOptions)
   ╎  641 @Base/boot.jl:368; eval
   ╎   641 ...tems/src/workers.jl:309; startworker()
   ╎    641 ...tems/src/workers.jl:339; serve_requests(io::Sockets.TCP...
   ╎     641 ...c/Serialization.jl:800; deserialize(s::Sockets.TCPSoc...
   ╎    ╎ 641 ...c/Serialization.jl:813; deserialize
   ╎    ╎  641 @Base/stream.jl:978; read(this::Sockets.TCPSocket...
   ╎    ╎   641 @Base/stream.jl:106; eof(s::Sockets.TCPSocket)
   ╎    ╎    641 @Base/stream.jl:416; wait_readnb(x::Sockets.TCPS...
   ╎    ╎     641 @Base/condition.jl:124; wait(c::Base.GenericCondit...
   ╎    ╎    ╎ 641 @Base/task.jl:930; wait()
640╎    ╎    ╎  641 @Base/task.jl:921; poptask(W::Base.InvasiveL...

Thread 2 Task 0x00007c19b7322b30 Total snapshots: 641. Utilization: 0%
   ╎641 ...readingconstructs.jl:258; (::ReTestItems.Workers.var"#24#2...
   ╎ 641 ...Items/src/workers.jl:347; macro expansion
   ╎  641 ...tems/src/workers.jl:333; execute
   ╎   641 @Base/boot.jl:368; eval
   ╎    641 .../src/ReTestItems.jl:929; (::ReTestItems.var"#runtestite...
   ╎     641 ...src/ReTestItems.jl:999; runtestitem(ti::TestItem, ctx...
   ╎    ╎ 641 ...tems/src/macros.jl:82; macro expansion
   ╎    ╎  641 ...rc/log_capture.jl:107; _redirect_logs
   ╎    ╎   641 @Base/io.jl:381; open
   ╎    ╎    641 @Base/io.jl:384; open(::ReTestItems.var"#27#...
   ╎    ╎     641 ...c/log_capture.jl:107; #27
   ╎    ╎    ╎ 641 ...c/log_capture.jl:122; _redirect_logs(f::ReTestI...
   ╎    ╎    ╎  641 @Base/stream.jl:1374; redirect_stdio##kw
   ╎    ╎    ╎   641 @Base/stream.jl:1411; redirect_stdio(f::ReTest...
   ╎    ╎    ╎    641 ...ReTestItems.jl:1000; #86
   ╎    ╎    ╎     641 ...ReTestItems.jl:831; with_source_path(f::ReT...
   ╎    ╎    ╎    ╎ 641 ...eTestItems.jl:1000; #87
   ╎    ╎    ╎    ╎  641 @Base/boot.jl:368; eval
   ╎    ╎    ╎    ╎   641 ...asyncevent.jl:240; sleep(sec::Float64)
   ╎    ╎    ╎    ╎    641 ...asyncevent.jl:155; wait
   ╎    ╎    ╎    ╎     641 ...syncevent.jl:138; _trywait(t::Timer)
   ╎    ╎    ╎    ╎    ╎ 641 ...condition.jl:124; wait(c::Base.Generi...
   ╎    ╎    ╎    ╎    ╎  641 @Base/task.jl:930; wait()
640╎    ╎    ╎    ╎    ╎   641 @Base/task.jl:921; poptask(W::Base.In...

Thread 3 Task 0x00007c19b6efc010 Total snapshots: 641. Utilization: 0%
   ╎641 @Base/task.jl:634; task_done_hook(t::Task)
   ╎ 641 @Base/task.jl:930; wait()
640╎  641 @Base/task.jl:921; poptask(W::Base.InvasiveLinkedL...

@Drvi Drvi marked this pull request as ready for review April 27, 2024 13:26
@Drvi Drvi requested a review from quinnj April 27, 2024 13:27
@Drvi Drvi changed the title WIP: Trigger a profile before terminating worker upon timeout Trigger a profile before terminating worker upon timeout Apr 27, 2024
@NHDaly NHDaly self-requested a review April 29, 2024 17:57
@NHDaly
Copy link
Member

NHDaly commented Apr 29, 2024

yeah, too bad about the SIGNAL 15, but hopefully that can be resolved by #152 if we can ever get that PR fixed up

Copy link
Collaborator

@kpamnany kpamnany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small question and suggestion only.

@@ -253,13 +259,14 @@ function runtests(
nworkers = max(0, nworkers)
retries = max(0, retries)
timeout = ceil(Int, testitem_timeout)
timeout_profile_wait = ceil(Int, timeout_profile_wait)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will throw an InexactError if you pass Inf. Maybe that's fine?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah Infs, NaNs and negative values will throw at some point, but I don't think this is an issue in practice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a check for the negative numbers as we do the same thing for timeout

@Drvi Drvi enabled auto-merge (squash) May 6, 2024 16:35
@Drvi Drvi merged commit 94e3fc3 into main May 6, 2024
@Drvi Drvi deleted the npr-profile-timeouts branch May 6, 2024 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Print profile dump before killing worker when we timeout
5 participants