Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robust rmtree #4440

Closed
wants to merge 4 commits into from
Closed

Robust rmtree #4440

wants to merge 4 commits into from

Conversation

jfennick
Copy link
Contributor

Changelog Entry

To be copied to the draft changelog by merger:

This PR catches some additional exceptions I've been encountering on my slurm cluster. At this time, I believe they are caused by the network file system (panasas), but I don't have a way of testing that in isolation. I'm not sure if you will be able to reproduce this, but if you have suggestions let me know.

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passes tests.
  • Make sure the PR has been reviewed since its last modification. If not, review it.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

@mr-c
Copy link
Contributor

mr-c commented Apr 19, 2023

Thanks! I've pushed this up as https://github.com/DataBiosphere/toil/tree/issues/4440-robust-rmtree for testing

@jfennick
Copy link
Contributor Author

I'm getting one more stack trace:

       Traceback (most recent call last):
          File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/site-packages/toil/worker.py", line 390, in workerScript
            with fileStore.open(job):
          File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/contextlib.py", line 119, in __enter__
            return next(self.gen)
          File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/site-packages/toil/fileStores/nonCachingFileStore.py", line 66, in open
            self._removeDeadJobs(self.coordination_dir)
          File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/site-packages/toil/fileStores/nonCachingFileStore.py", line 196, in _removeDeadJobs
            for jobState in cls._getAllJobStates(coordination_dir):
          File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/site-packages/toil/fileStores/nonCachingFileStore.py", line 251, in _getAllJobStates
            yield NonCachingFileStore._readJobState(fname)
          File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/site-packages/toil/fileStores/nonCachingFileStore.py", line 262, in _readJobState
            state = dill.load(fH)
          File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/site-packages/dill/_dill.py", line 272, in load
            return Unpickler(file, ignore=ignore, **kwds).load()
          File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/site-packages/dill/_dill.py", line 419, in load
            obj = StockUnpickler.load(self)
          File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/pickle.py", line 1222, in load
            key = read(1)
          File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/pickle.py", line 311, in read
            return self.file_read(n)
        OSError: [Errno 5] Input/output error

This call site doesn't have a try ... except block, and it's not clear to me how to recover from I/O failure when unpickling. (Just retry?)

FYI I installed slurm on an EC2 instance, copied over the exact slurm.conf file from my cluster (except I only have 1 CPU, etc), and I'm only seeing these problems on my cluster which has the panasas network file system.

I also tried installing slurm on my local machine through Windows Subsystem for Linux, but ran into numerous problems and I was not able to succeed.

@mr-c
Copy link
Contributor

mr-c commented Apr 19, 2023

This call site doesn't have a try ... except block, and it's not clear to me how to recover from I/O failure when unpickling. (Just retry?)

Yes, I would from toil.lib.retry import retry and add @retry(errors=[OSError]) to

def _readJobState(jobStateFileName: str) -> Dict[str, str]:

also: I'm surprised that you got toil to work with pypy!

@jfennick
Copy link
Contributor Author

Based on some additional stack traces, I think I found something else:

https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/fileJobStore.py#L260-L265
https://stackoverflow.com/questions/41362016/rename-atomicity-and-nfs

This seems like a problem. Maybe those lines can be wrapped in a global_mutex?

Copy link
Member

@DailyDreaming DailyDreaming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jfennick Thank you for submitting this! Just one comment.

except OSError as exc:
if exc.errno == 16:
# 'Device or resource busy'
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... would it be better for this function to be retrying for a short period of time on OSError 16 rather than immediately returning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just blindly copied the above except clause. You're probably right that retrying would be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I removed the except OSError clauses from robust_rmtree and added @retry(errors=[OSError]), but now my log file looks like this:

[2023-04-20T13:41:59+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 1 s...
[2023-04-20T13:42:00+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 1 s...
[2023-04-20T13:42:01+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 1 s...
[2023-04-20T13:42:02+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 2 s...
[2023-04-20T13:42:04+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 4 s...
[2023-04-20T13:42:08+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 8 s...
[2023-04-20T13:42:16+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 16 s...
...
[2023-04-20T13:42:32+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 1 s...
[2023-04-20T13:42:33+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 1 s...
[2023-04-20T13:42:34+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 1 s...
[2023-04-20T13:42:35+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 2 s...
[2023-04-20T13:42:37+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 4 s...
[2023-04-20T13:42:41+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 8 s...
[2023-04-20T13:42:49+0000] [MainThread] [W] [toil.lib.retry] Error in <function robust_rmtree at 0x00007fe76085ce80>: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'. Retrying after 16 s...
[2023-04-20T13:45:43+0000] [MainThread] [E] [toil.deferred] [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'
Traceback (most recent call last):
  File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/site-packages/toil/deferred.py", line 214, in cleanupWorker
    robust_rmtree(os.path.join(stateDirBase, cls.STATE_DIR_STEM))
  File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/site-packages/toil/lib/retry.py", line 292, in call
    return func(*args, **kwargs)
  File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/site-packages/toil/lib/io.py", line 57, in robust_rmtree
    robust_rmtree(child_path)
  File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/site-packages/toil/lib/retry.py", line 292, in call
    return func(*args, **kwargs)
  File "/home/fennickjr/mambaforge-pypy3/envs/wic/lib/pypy3.9/site-packages/toil/lib/io.py", line 75, in robust_rmtree
    os.unlink(path)
OSError: [Errno 16] Device or resource busy: b'/home/fennickjr/coorddir/57d4a87a1044516e91e442d91dae76ad/deferred/.panfs.b0810ac.1682012487623425000'

That retry block is repeating many times, and after about 5 minutes it throws the stack trace anyway. Is this what we want? And it doesn't just do this once. This whole process repeats for various other lock files, so adding in this retry is killing throughput.

My cluster administrator is not yet convinced there's anything wrong with our cluster and/or with the panasas file system, but IDK.

Copy link
Contributor

@mr-c mr-c Apr 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @retry(errors=[OSError]) is too broad to use on the entire robust_rmtree function; we only want to do so for exc.errno == 16 # 'Device or resource busy'; so that could lead to attempting to remove a directory that has already been removed but had a transient error before, and thus another OSError.

@adamnovak or @DailyDreaming , is there a clever way to use @retry for only OSError of a particular errno?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a way to write an ErrorCondition that matches only some errors of a particular type. I'm not sure if it would work here.

@DailyDreaming
Copy link
Member

Based on some additional stack traces, I think I found something else:

https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/fileJobStore.py#L260-L265 https://stackoverflow.com/questions/41362016/rename-atomicity-and-nfs

This seems like a problem. Maybe those lines can be wrapped in a global_mutex?

That sounds like a good idea to me. Might be worth making a separate PR just to keep things simple. Thanks again.

@jfennick
Copy link
Contributor Author

Based on some additional stack traces, I think I found something else:
https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/fileJobStore.py#L260-L265 https://stackoverflow.com/questions/41362016/rename-atomicity-and-nfs
This seems like a problem. Maybe those lines can be wrapped in a global_mutex?

That sounds like a good idea to me. Might be worth making a separate PR just to keep things simple. Thanks again.

Yeah, that's a bigger change so I was gonna keep it separate.

@mr-c
Copy link
Contributor

mr-c commented Apr 21, 2023

@adamnovak adamnovak mentioned this pull request May 4, 2023
@adamnovak
Copy link
Member

I merged this in #4464.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants