-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Robust rmtree #4440
Robust rmtree #4440
Conversation
Thanks! I've pushed this up as https://github.com/DataBiosphere/toil/tree/issues/4440-robust-rmtree for testing |
I'm getting one more stack trace:
This call site doesn't have a FYI I installed slurm on an EC2 instance, copied over the exact slurm.conf file from my cluster (except I only have 1 CPU, etc), and I'm only seeing these problems on my cluster which has the panasas network file system. I also tried installing slurm on my local machine through Windows Subsystem for Linux, but ran into numerous problems and I was not able to succeed. |
Yes, I would
also: I'm surprised that you got toil to work with |
Based on some additional stack traces, I think I found something else: https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/fileJobStore.py#L260-L265 This seems like a problem. Maybe those lines can be wrapped in a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jfennick Thank you for submitting this! Just one comment.
except OSError as exc: | ||
if exc.errno == 16: | ||
# 'Device or resource busy' | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... would it be better for this function to be retrying for a short period of time on OSError 16 rather than immediately returning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just blindly copied the above except clause. You're probably right that retrying would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I removed the except OSError
clauses from robust_rmtree
and added @retry(errors=[OSError])
, but now my log file looks like this:
[2023-04-20T13:41:59+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 1 s...
[2023-04-20T13:42:00+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 1 s...
[2023-04-20T13:42:01+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 1 s...
[2023-04-20T13:42:02+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 2 s...
[2023-04-20T13:42:04+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 4 s...
[2023-04-20T13:42:08+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 8 s...
[2023-04-20T13:42:16+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 16 s...
...
[2023-04-20T13:42:32+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 1 s...
[2023-04-20T13:42:33+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 1 s...
[2023-04-20T13:42:34+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 1 s...
[2023-04-20T13:42:35+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 2 s...
[2023-04-20T13:42:37+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 4 s...
[2023-04-20T13:42:41+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 8 s...
[2023-04-20T13:42:49+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 16 s...
[2023-04-20T13:45:43+0000] [MainThread] [E] [toil.deferred] [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'
Traceback (most recent call last):
File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/site-packages/toil/deferred.py", line 214, in cleanupWorker
robust_rmtree(os.path.join(stateDirBase, cls.STATE_DIR_STEM))
File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/site-packages/toil/lib/retry.py", line 292, in call
return func(*args, **kwargs)
File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/site-packages/toil/lib/io.py", line 57, in robust_rmtree
robust_rmtree(child_path)
File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/site-packages/toil/lib/retry.py", line 292, in call
return func(*args, **kwargs)
File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/site-packages/toil/lib/io.py", line 75, in robust_rmtree
os.unlink(path)
OSError: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'
That retry block is repeating many times, and after about 5 minutes it throws the stack trace anyway. Is this what we want? And it doesn't just do this once. This whole process repeats for various other lock files, so adding in this retry is killing throughput.
My cluster administrator is not yet convinced there's anything wrong with our cluster and/or with the panasas file system, but IDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @retry(errors=[OSError])
is too broad to use on the entire robust_rmtree
function; we only want to do so for exc.errno == 16 # 'Device or resource busy'
; so that could lead to attempting to remove a directory that has already been removed but had a transient error before, and thus another OSError
.
@adamnovak or @DailyDreaming , is there a clever way to use @retry
for only OSError
of a particular errno
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a way to write an ErrorCondition
that matches only some errors of a particular type. I'm not sure if it would work here.
That sounds like a good idea to me. Might be worth making a separate PR just to keep things simple. Thanks again. |
Yeah, that's a bigger change so I was gonna keep it separate. |
I've re-pushed this up to https://github.com/DataBiosphere/toil/tree/issues/4440-robust-rmtree |
I merged this in #4464. |
Changelog Entry
To be copied to the draft changelog by merger:
This PR catches some additional exceptions I've been encountering on my slurm cluster. At this time, I believe they are caused by the network file system (panasas), but I don't have a way of testing that in isolation. I'm not sure if you will be able to reproduce this, but if you have suggestions let me know.
Reviewer Checklist
issues/XXXX-fix-the-thing
in the Toil repo, or from an external repo.camelCase
that want to be insnake_case
.docs/running/{cliOptions,cwl,wdl}.rst
Merger Checklist