fix: atomic cache writes to prevent corruption from parallel workers#128
fix: atomic cache writes to prevent corruption from parallel workers#128
Conversation
null_dist_cached() writes .npy cache files non-atomically, causing corruption when multiple parallel workers race on the same cache key. Uses tempfile + os.replace for atomic writes and broadens the exception handler to catch EOFError and OSError in addition to ValueError.
There was a problem hiding this comment.
My first impression was: Is it necessary to save on the side and then replace the original file? I would expect waiting a bit for the other worker to finish writing should do the trick most of the time.
Rewriting the npz seems unnecessary unless it is indeed corrupted, which theoretically could be checked through a number of attempts. The PR is good though, I just want to check if we need writing the files multiple times whenever there is a race condition.
The question is, do multiple workers fail due to the race condition only the slowest one?
Based on the failure modes:
1) EOFError: No data left in file
2) ValueError: Failed to read all data for array... file seems not fully written?
3) Silent data corruption (MISMATCH: written != read back)
- Unclear about what this means actually (in this context)
- This would skip, wait and retry late once it is written
- This wouldn't happen (if the one that writes later cancels the operation and waits)
I think this would mean that we wouldn't need to write and overwrite the files, and remove the need of the _atomic_commit.
|
Caching is a non-trivial task. The Although not recently updated, https://github.com/grantjenks/python-diskcache seems to be a right spot between doing everything by ourselves and big frameworks like Redis. |
Use diskcache (SQLite-backed) for null distribution caching instead of raw np.save/np.load. This eliminates race conditions in parallel workers by leveraging SQLite's ACID guarantees rather than atomic file renames. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Replaced the hand-rolled .npy caching with |
|
I am not sure if we should be using a dependency that hasn't been updated in three years. Sure, the functionality shouldn't change and if the library is complete it is, and it is pure-python which is great, but it is always a risk for things to stop working because Python. My concern is that it is hard to know if it is abandonware (according to some it is, based on the topics linked to grantjenks/python-diskcache#357). Some folks forked and fixed the CVE https://github.com/wandb/weave/pull/6389/changes, but that is a temporary patch. I'm not sure what is the best course of action, but I do like diskcache or something similar. |
|
I have consulted with the High Council (@leoank and @gnodar01). The overall conclusion is that even if there is an issue down the line the library is small and simple enough that any of us could fork it and get it working with relative ease. We can squash and merge, provided the tests pass (I don't know why the workflow is not running). |
|
Tests pass locally. I re-read the code and the change looks small enough to me. Feel free to bring up any issues, otherwise I will squash and merge at the end of the day. |
Summary
null_dist_cached()writes.npycache files non-atomically, causing corruption when multiple parallel workers race on the same cache key (samen_totalandk_num_pos).Problem
When copairs runs with parallel workers (via
multiprocessing.Pool), two workers can:EOFErrororValueErrorThree failure modes observed:
EOFError: No data left in fileValueError: Failed to read all data for array... file seems not fully written?Additionally, the existing corruption handler only catches
ValueError, missingEOFErrorandOSError.Fix
Atomic writes via
tempfile.mktemp()+os.replace()— write to a temp file in the same directory, then atomically rename.os.replaceis atomic on POSIX (same filesystem), so readers either see the old complete file or the new complete file, never a partial write.Broader exception handling — catch
(ValueError, EOFError, OSError)on cache load to handle all corruption modes.Stress test
Included test (
test_null_dist_cached_parallel) spawns 16 workers all racing on the same cache key. Without the fix, every round fails. With the fix, all rounds pass.Test plan
test_null_dist_cached,test_null_dist_cached_corrupt)