Skip to content

fix: atomic cache writes to prevent corruption from parallel workers#128

Merged
afermg merged 2 commits intomainfrom
fix/atomic-cache-writes
Apr 3, 2026
Merged

fix: atomic cache writes to prevent corruption from parallel workers#128
afermg merged 2 commits intomainfrom
fix/atomic-cache-writes

Conversation

@shntnu
Copy link
Copy Markdown
Member

@shntnu shntnu commented Mar 25, 2026

Summary

null_dist_cached() writes .npy cache files non-atomically, causing corruption when multiple parallel workers race on the same cache key (same n_total and k_num_pos).

Problem

When copairs runs with parallel workers (via multiprocessing.Pool), two workers can:

  1. Both see a cache file as missing → both compute → both write → one truncates the other mid-write
  2. One worker writes while another reads → partial read → EOFError or ValueError

Three failure modes observed:

  • EOFError: No data left in file
  • ValueError: Failed to read all data for array... file seems not fully written?
  • Silent data corruption (MISMATCH: written != read back)

Additionally, the existing corruption handler only catches ValueError, missing EOFError and OSError.

Fix

  1. Atomic writes via tempfile.mktemp() + os.replace() — write to a temp file in the same directory, then atomically rename. os.replace is atomic on POSIX (same filesystem), so readers either see the old complete file or the new complete file, never a partial write.

  2. Broader exception handling — catch (ValueError, EOFError, OSError) on cache load to handle all corruption modes.

Stress test

Included test (test_null_dist_cached_parallel) spawns 16 workers all racing on the same cache key. Without the fix, every round fails. With the fix, all rounds pass.

# Without fix (PyPI copairs 0.5.2):
20/20 rounds had failures — cache is NOT race-safe

# With fix:
All 20 rounds passed with 32 parallel workers — cache is race-safe

Test plan

  • Existing tests pass (test_null_dist_cached, test_null_dist_cached_corrupt)
  • New parallel stress test passes
  • Verified fix in production pipeline (jump_production, 79 parallel copairs jobs)

null_dist_cached() writes .npy cache files non-atomically, causing
corruption when multiple parallel workers race on the same cache key.
Uses tempfile + os.replace for atomic writes and broadens the exception
handler to catch EOFError and OSError in addition to ValueError.
Copy link
Copy Markdown
Collaborator

@afermg afermg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first impression was: Is it necessary to save on the side and then replace the original file? I would expect waiting a bit for the other worker to finish writing should do the trick most of the time.

Rewriting the npz seems unnecessary unless it is indeed corrupted, which theoretically could be checked through a number of attempts. The PR is good though, I just want to check if we need writing the files multiple times whenever there is a race condition.

The question is, do multiple workers fail due to the race condition only the slowest one?

Based on the failure modes:

1) EOFError: No data left in file
2) ValueError: Failed to read all data for array... file seems not fully written?
3) Silent data corruption (MISMATCH: written != read back)
  1. Unclear about what this means actually (in this context)
  2. This would skip, wait and retry late once it is written
  3. This wouldn't happen (if the one that writes later cancels the operation and waits)

I think this would mean that we wouldn't need to write and overwrite the files, and remove the need of the _atomic_commit.

@johnarevalo
Copy link
Copy Markdown
Member

Caching is a non-trivial task. The .copairs impl was a quick and dirty approach. At this point it could be worth to look at specialized lightweight libraries that support concurrency, storage limits, LRU policies and so on.

Although not recently updated, https://github.com/grantjenks/python-diskcache seems to be a right spot between doing everything by ourselves and big frameworks like Redis.

Use diskcache (SQLite-backed) for null distribution caching instead of
raw np.save/np.load. This eliminates race conditions in parallel workers
by leveraging SQLite's ACID guarantees rather than atomic file renames.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@shntnu
Copy link
Copy Markdown
Member Author

shntnu commented Mar 28, 2026

Replaced the hand-rolled .npy caching with diskcache per @johnarevalo's suggestion. SQLite-backed ACID guarantees handle concurrency properly, so the atomic-write workaround and corruption recovery are no longer needed. Net reduction of ~40 lines.

@afermg
Copy link
Copy Markdown
Collaborator

afermg commented Mar 30, 2026

I am not sure if we should be using a dependency that hasn't been updated in three years. Sure, the functionality shouldn't change and if the library is complete it is, and it is pure-python which is great, but it is always a risk for things to stop working because Python. My concern is that it is hard to know if it is abandonware (according to some it is, based on the topics linked to grantjenks/python-diskcache#357). Some folks forked and fixed the CVE https://github.com/wandb/weave/pull/6389/changes, but that is a temporary patch. I'm not sure what is the best course of action, but I do like diskcache or something similar.

@shntnu
Copy link
Copy Markdown
Member Author

shntnu commented Mar 30, 2026

@afermg - I'm not pondering this too deeply, but it is certainly an option to revert b4e8679 and build on the previous one (which has no additional dependencies)

@afermg
Copy link
Copy Markdown
Collaborator

afermg commented Mar 30, 2026

I have consulted with the High Council (@leoank and @gnodar01). The overall conclusion is that even if there is an issue down the line the library is small and simple enough that any of us could fork it and get it working with relative ease. We can squash and merge, provided the tests pass (I don't know why the workflow is not running).

@afermg
Copy link
Copy Markdown
Collaborator

afermg commented Mar 30, 2026

Tests pass locally. I re-read the code and the change looks small enough to me. Feel free to bring up any issues, otherwise I will squash and merge at the end of the day.

❄️impure .venv ❯  pytest
============================= test session starts ==============================platform linux -- Python 3.11.11, pytest-9.0.1, pluggy-1.6.0
rootdir: /home/amunoz/projects/copairs
configfile: pyproject.toml
plugins: anyio-4.11.0
collected 67 items                                                             

tests/test_build_rank_multilabel.py .                                    [  1%]
tests/test_compute.py ...........                                        [ 17%]
tests/test_hierarchical_fdr.py ....                                      [ 23%]
tests/test_map.py ............                                           [ 41%]
tests/test_map_filter.py ....                                            [ 47%]
tests/test_matching.py ..........                                        [ 62%]
tests/test_matching_any.py ....                                          [ 68%]
tests/test_matching_multilabel.py ........                               [ 80%]
tests/test_normalization.py .....                                        [ 88%]
tests/test_normalization_integration.py ....                             [ 94%]
tests/test_reference_index.py .                                          [ 95%]
tests/test_replicating.py ...                                            [100%]

=============================== warnings summary ===============================tests/test_matching.py::test_raise_distjoint
  /home/amunoz/projects/copairs/src/copairs/matching.py:494: DeprecationWarning: Passing strings to 'sameby' is deprecated and will be removed in v0.5.3.
    sameby, diffby = _validate(sameby, diffby)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================== 67 passed, 1 warning in 46.17s ========================~/projects/copairs remotes/cytomining/fix/atomic-cache-writes* ≡ 46s
❄️impure .venv ❯  

@afermg afermg merged commit 620ec95 into main Apr 3, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants