-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QuickSort hangs #104
Comments
@chriselrod I'm wondering if the hang was data-dependent. If that's the case, can you get the hang more quickly by seed = Ref(0)
@btime ThreadsX.sort!(xs) setup=(xs=rand(MersenneTwister(seed[] += 1), 0:0.01:1, 1_000_000)) and then using the same seed @btime ThreadsX.sort!(xs) setup=(xs=rand(MersenneTwister(seed[]), 0:0.01:1, 1_000_000)) ? Also, it would be great if you can try |
julia> while true; @btime ThreadsX.sort!(xs) setup=(xs=rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)); end
...
seed[] += 1 = 34162
seed[] += 1 = 34163
seed[] += 1 = 34164
seed[] += 1 = 34165
seed[] += 1 = 34166
seed[] += 1 = 34167
seed[] += 1 = 34168
seed[] += 1 = 34169
seed[] += 1 = 34170
seed[] += 1 = 34171
seed[] += 1 = 34172
seed[] += 1 = 34173
seed[] += 1 = 34174
seed[] += 1 = 34175
seed[] += 1 = 34176
seed[] += 1 = 34177
seed[] += 1 = 34178
seed[] += 1 = 34179
seed[] += 1 = 34180
seed[] += 1 = 34181
seed[] += 1 = 34182
seed[] += 1 = 34183
seed[] += 1 = 34184
seed[] += 1 = 34185
seed[] += 1 = 34186
seed[] += 1 = 34187
seed[] += 1 = 34188
seed[] += 1 = 34189
seed[] += 1 = 34190
seed[] += 1 = 34191
seed[] += 1 = 34192
seed[] += 1 = 34193
seed[] += 1 = 34194
seed[] += 1 = 34195
seed[] += 1 = 34196
seed[] += 1 = 34197
seed[] += 1 = 34198
seed[] += 1 = 34199
^Cfatal: error thrown and no exception handler available.
InterruptException()
jl_mutex_unlock at /home/chriselrod/Documents/languages/julia/src/locks.h:143 [inlined]
jl_task_get_next at /home/chriselrod/Documents/languages/julia/src/partr.c:441
poptaskref at ./task.jl:702
wait at ./task.jl:709 [inlined]
task_done_hook at ./task.jl:444
jl_apply at /home/chriselrod/Documents/languages/julia/src/julia.h:1685 [inlined]
jl_finish_task at /home/chriselrod/Documents/languages/julia/src/task.c:198
start_task at /home/chriselrod/Documents/languages/julia/src/task.c:697
unknown function (ip: (nil)) It finally hung with julia> using ThreadsX, BenchmarkTools, Random
julia> seed = Ref(34199)
Base.RefValue{Int64}(34199)
julia> @btime ThreadsX.sort!(xs) setup=(xs=rand(MersenneTwister(seed[]), 0:0.01:1, 1_000_000));
2.749 ms (21066 allocations: 10.11 MiB) |
Thanks for trying it out! So it's more like a timing thing...? I asked people in #multithreading slack channel if they have some ideas about this. |
Have you tried to reproduce with using BenchmarkTools, ThreadsX, Random
seed = Ref(0);
while true; @btime ThreadsX.sort!(xs) setup=(xs=rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)); end ? |
I tried it a bit and then switched to Do you mind sharing |
I couldn't get the hang but I get this part, too. Meaning that ctrl-C during the |
No hang? Interesting. For how long have you been running it? I can provide the full lscpu if you need it, but for now I truncated the details on vulnerabilities and flags: > lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 36
On-line CPU(s) list: 0-35
Thread(s) per core: 2
Core(s) per socket: 18
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
Stepping: 7
CPU MHz: 4080.897
CPU max MHz: 4600.0000
CPU min MHz: 1200.0000
BogoMIPS: 6000.00
Virtualization: VT-x
L1d cache: 576 KiB
L1i cache: 576 KiB
L2 cache: 18 MiB
L3 cache: 24.8 MiB
NUMA node0 CPU(s): 0-35 It doesn't always crash. I just got: ^CERROR: InterruptException:
Stacktrace:
[1] poptaskref(::Base.InvasiveLinkedListSynchronized{Task}) at ./task.jl:702
[2] wait at ./task.jl:709 [inlined]
[3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
[4] _wait(::Task) at ./task.jl:238
[5] sync_end(::Array{Any,1}) at ./task.jl:294
[6] macro expansion at ./task.jl:335 [inlined]
[7] _quicksort!(::Array{Float64,1}, ::SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}, ::ThreadsX.Implementations.ParallelQuickSortAlg{Base.Sort.QuickSortAlg,Int64,Int64}, ::Base.Order.ForwardOrdering, ::Array{Int8,1}, ::Bool, ::Bool) at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:97
[8] sort!(::Array{Float64,1}, ::Int64, ::Int64, ::ThreadsX.Implementations.ParallelQuickSortAlg{Base.Sort.QuickSortAlg,Nothing,Int64}, ::Base.Order.ForwardOrdering) at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:22
[9] _sort! at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:130 [inlined]
[10] #sort!#86 at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:170 [inlined]
[11] sort! at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:156 [inlined]
[12] ##core#354(::Array{Float64,1}) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:371
[13] ##sample#355(::BenchmarkTools.Parameters) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:377
[14] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#353")}, ::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:411
[15] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#353")}, ::BenchmarkTools.Parameters) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:399
[16] #invokelatest#1 at ./essentials.jl:710 [inlined]
[17] invokelatest at ./essentials.jl:709 [inlined]
[18] #run_result#37 at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:32 [inlined]
[19] run_result at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:32 [inlined] (repeats 2 times)
[20] top-level scope at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:483
[21] top-level scope at REPL[16]:1
[22] eval(::Module, ::Any) at ./boot.jl:331
[23] eval_user_input(::Any, ::REPL.REPLBackend) at /home/chriselrod/Documents/languages/julia/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:130
[24] run_backend(::REPL.REPLBackend) at /home/chriselrod/.julia/packages/Revise/2K7IK/src/Revise.jl:1070
[25] top-level scope at none:0 This is an issue for crashing on the interrupt exceptions: JuliaLang/julia#34184 |
Sadly no. I ran ~40 min last time and then gave up.
Thanks, this is enough. I was mostly wondering NUMA topology as I saw some strange crash before when using multiple NUMA nodes (not sure if it's related at all though). |
Ah, I was actually using julia nightly I downloaded a while ago (1.5.0-DEV.464). Now I switched to 1.5.0-DEV.526, I get the hang! I get it with @chriselrod Thanks for the help! Sorry for the extra works that could be avoided if I were a bit more cautious. |
Did you try I wanted to bisect, so I jumped back to 1.5.0-DEV.460, and got the hang there. |
You are right. I tried to bisect the bug and it hangs with 1.5.0-DEV.465 (and so presumably with 1.5.0-DEV.464) too. While trying bisection I noticed that it can took more than 20 minutes sometimes to get a hang in my machine. So maybe I was too impatient before. I also switched to docker to run this (since the host OS is too old to build julia) so it may affect things somewhat... Anyway, I'll try to run this over night with a larger range while no one is using the machine. Bisection script (used with run(`make -j$(Sys.CPU_THREADS)`)
code = raw"""
using BenchmarkTools, ThreadsX, Random
seed = Ref(0)
while true
@btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)))
end
"""
const EXIT_CODE = Ref(2)
jl = run(
pipeline(`$(Base.julia_cmd()) -e "$code"`; stdout = stdout, stderr = stderr);
wait = false,
)
try
pid = getpid(jl)
start = time_ns()
top = open(pipeline(`top -b -d 10 -p$pid`; stderr = stderr))
try
unknown_line = 0
while true
local i
while true
ln = readline(top)
if match(r"^ *PID ", ln) != nothing
i = findfirst(==("%CPU"), split(ln))
if i == nothing
error("Cannot parse `top` header:\n", ln)
end
break
end
end
ln = readline(top)
columns = split(ln)
if string(pid) in columns
unknown_line = 0
pcpu = parse(Float64, columns[i])
@info "%CPU = $pcpu"
minutes = (time_ns() - start) / 1000^3 / 60
if minutes < 1
@info "Ignoring %CPU for first 1 minute... $((time_ns() - start) / 1000^3) seconds passed."
continue
end
if pcpu < 10.0
EXIT_CODE[] = 1
break
end
if minutes > 30
@info "More than 30 minutes without a hang."
EXIT_CODE[] = 0
break
end
@info "$minutes minutes without a hang...."
else
if jl.exitcode >= 0
error("Unexpected exit")
end
unknown_line > 10 && error("Too many parse failures")
unknown_line += 1
@error "Cannot parse `top` output. Continuing..." ln
end
end
finally
kill(top)
end
finally
kill(jl)
end
exit(EXIT_CODE[]) |
That's incredible -- really neat script! I ran I'm running your script now, and will let you know how it goes. |
Great! This is good news! Thanks for figuring out the version that it works. I did think the bug might be in the scheduler but if it weren't then I'd be hopelessly lost. |
I modified the script to use run(`make -j$(Sys.CPU_THREADS)`)
code = raw"""
using BenchmarkTools, ThreadsX, Random
seed = Ref(0)
while true
@btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)))
end
"""
const EXIT_CODE = Ref(2)
jl = run(
pipeline(`$(Base.julia_cmd()) -e "$code"`; stdout = stdout, stderr = stderr);
wait = false,
)
try
pid = getpid(jl)
io = IOBuffer()
cmd = pipeline(`ps -p$pid -h -o %cpu`, stdout = io)
start = time_ns()
while true
sleep(10)
run(cmd)
pcpu = parse(Float64, String(take!(io)))
@info "%CPU = $pcpu"
minutes = (time_ns() - start) / 1000^3 / 60
if minutes < 1
@info "Ignoring %CPU for first 1 minute... $((time_ns() - start) / 1000^3) seconds passed."
continue
end
if pcpu < 10.0
EXIT_CODE[] = 1
break
end
if minutes > 30
@info "More than 30 minutes without a hang."
EXIT_CODE[] = 0
break
end
@info "$minutes minutes without a hang...."
end
finally
kill(jl)
end
exit(EXIT_CODE[]) Yeah,also fortunate that this bug doesn't seem to impact 1.4 / any released version of Julia. |
Hmm... That's too bad that we can't rely on that
This may be OS-dependent, but when I tried using |
You're right, I just ran into this problem. It hung, but was declining slowly from >700% (with 8 threads). It wouldn't have gotten to <10% before the 30 minutes were up. |
I think this may work: run(`make -j$(Sys.CPU_THREADS)`)
code = raw"""
using BenchmarkTools, ThreadsX, Random
seed = Ref(0)
while true
@btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)))
end
"""
const EXIT_CODE = Ref(2)
jl = run(
pipeline(`$(Base.julia_cmd()) -e "$code"`; stdout = stdout, stderr = stderr);
wait = false,
)
try
start = time_ns()
pid = getpid(jl)
io = IOBuffer()
cmd = pipeline(`ps -p$pid -h -o cputimes`, stdout = io)
current_cpu = last_cpu = 0.0;
last_time = current_time = 0.0;
while true
sleep(10)
run(cmd)
current_cpu = parse(Float64, String(take!(io)))
current_time = (time_ns() - start) * 1e-9
pcpu = 100.0 * (current_cpu - last_cpu) / (current_time - last_time)
@info "%CPU = $pcpu"
minutes = current_time / 60
if minutes < 1
@info "Ignoring %CPU for first 1 minute... $(current_time) seconds passed."
continue
end
if pcpu < 10.0
EXIT_CODE[] = 1
break
end
if minutes > 120
@info "More than 30 minutes without a hang."
EXIT_CODE[] = 0
break
end
@info "$minutes minutes without a hang...."
last_time = current_time; last_cpu = current_cpu;
end
finally
kill(jl)
end
exit(EXIT_CODE[]) Above it's set to go for 2 hours before deciding that things work. I decided to try the following commit a few times: It may also be worth modifying the script to run multiple in parallel. I get the deadlock with 8 threads, and the CPU has 18 physical cores. It's probably worth running three instances of 8 threads in parallel, and declaring the commit bad if any lock up. EDIT: |
Could this be a problem? signal (11): Segmentation fault
in expression starting at REPL[3]:1
- at ./int.jl:52 [inlined]
unsafe_length at ./range.jl:517 [inlined]
unsafe_indices at ./abstractarray.jl:99 [inlined]
_indices_sub at ./subarray.jl:409 [inlined]
axes at ./subarray.jl:404 [inlined]
axes1 at ./abstractarray.jl:95 [inlined]
eachindex at ./abstractarray.jl:267 [inlined]
eachindex at ./abstractarray.jl:270 [inlined]
eachindex at ./abstractarray.jl:260 [inlined]
partition_sizes! at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:163
unknown function (ip: 0x7f67b598804f)
unknown function (ip: (nil))
Allocations: 10113947179 (Pool: 10113915735; Big: 31444); GC: 29811 That is: |
Yes, my bisection points to JuliaLang/julia#32599 (which includes JuliaLang/julia@65b8e7e), too. After this PR (JuliaLang/julia@dc46ddd) I can get the hang but not before the PR (JuliaLang/julia@f5dbc47). Somehow it was hard to get a hang with JuliaLang/julia@dc46ddd. It took 18 minutes to get a deadlock one time. I missed this commit the first time since it didn't cause the deadlock for 30 minutes. I ran the script for one hour for JuliaLang/julia@f5dbc47 once and it didn't cause the hang. |
Which Julia revision you used to get the segmentation fault? |
The last run with JuliaLang/julia@f5dbc47 (before JuliaLang/julia#32599) caused another segmentation fault
But, staring at the code, Lines 160 to 172 in 2e630bf
I don't know how it'd cause segfault. |
Looking at the code in your stacktrace #104 (comment), I'm puzzled why it'd cause a segmentation fault. Is it possible that the stacktrace is broken?
|
I'm running the MWE with Edit: I gave up after |
So far, I think we discovered that:
@chriselrod Do you agree? |
That was 1.5.0-DEV.300. JuliaLang/julia@65b8e7e
I gave up after I think you may be right, but it's really hard to be confident with how sporadic these hangs are. EDIT: |
So I just filed an issue in JuliaLang/julia#35341 with some summary of the things we discovered so far. I'm also trying to build |
In @chriselrod's machine, running the quicksort benchmark hangs with 8 threads. When it hangs, the CPU usage 0%. Originally reported here: https://discourse.julialang.org/t/ann-threadsx-jl-parallelized-base-functions/36666/14
At the moment, it seems that the MWE is
while true; @btime ThreadsX.sort($(rand(0:0.01:1, 1_000_000))); end
.Stacktrace:
The text was updated successfully, but these errors were encountered: