fix: atomic cache writes to prevent corruption from parallel workers by shntnu · Pull Request #128 · cytomining/copairs

shntnu · 2026-03-25T15:38:32Z

Summary

null_dist_cached() writes .npy cache files non-atomically, causing corruption when multiple parallel workers race on the same cache key (same n_total and k_num_pos).

Problem

When copairs runs with parallel workers (via multiprocessing.Pool), two workers can:

Both see a cache file as missing → both compute → both write → one truncates the other mid-write
One worker writes while another reads → partial read → EOFError or ValueError

Three failure modes observed:

EOFError: No data left in file
ValueError: Failed to read all data for array... file seems not fully written?
Silent data corruption (MISMATCH: written != read back)

Additionally, the existing corruption handler only catches ValueError, missing EOFError and OSError.

Fix

Atomic writes via tempfile.mktemp() + os.replace() — write to a temp file in the same directory, then atomically rename. os.replace is atomic on POSIX (same filesystem), so readers either see the old complete file or the new complete file, never a partial write.
Broader exception handling — catch (ValueError, EOFError, OSError) on cache load to handle all corruption modes.

Stress test

Included test (test_null_dist_cached_parallel) spawns 16 workers all racing on the same cache key. Without the fix, every round fails. With the fix, all rounds pass.

# Without fix (PyPI copairs 0.5.2):
20/20 rounds had failures — cache is NOT race-safe

# With fix:
All 20 rounds passed with 32 parallel workers — cache is race-safe

Test plan

Existing tests pass (test_null_dist_cached, test_null_dist_cached_corrupt)
New parallel stress test passes
Verified fix in production pipeline (jump_production, 79 parallel copairs jobs)

null_dist_cached() writes .npy cache files non-atomically, causing corruption when multiple parallel workers race on the same cache key. Uses tempfile + os.replace for atomic writes and broadens the exception handler to catch EOFError and OSError in addition to ValueError.

afermg

My first impression was: Is it necessary to save on the side and then replace the original file? I would expect waiting a bit for the other worker to finish writing should do the trick most of the time.

Rewriting the npz seems unnecessary unless it is indeed corrupted, which theoretically could be checked through a number of attempts. The PR is good though, I just want to check if we need writing the files multiple times whenever there is a race condition.

The question is, do multiple workers fail due to the race condition only the slowest one?

Based on the failure modes:

1) EOFError: No data left in file
2) ValueError: Failed to read all data for array... file seems not fully written?
3) Silent data corruption (MISMATCH: written != read back)

Unclear about what this means actually (in this context)
This would skip, wait and retry late once it is written
This wouldn't happen (if the one that writes later cancels the operation and waits)

I think this would mean that we wouldn't need to write and overwrite the files, and remove the need of the _atomic_commit.

johnarevalo · 2026-03-26T21:24:27Z

Caching is a non-trivial task. The .copairs impl was a quick and dirty approach. At this point it could be worth to look at specialized lightweight libraries that support concurrency, storage limits, LRU policies and so on.

Although not recently updated, https://github.com/grantjenks/python-diskcache seems to be a right spot between doing everything by ourselves and big frameworks like Redis.

Use diskcache (SQLite-backed) for null distribution caching instead of raw np.save/np.load. This eliminates race conditions in parallel workers by leveraging SQLite's ACID guarantees rather than atomic file renames. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

shntnu · 2026-03-28T02:33:33Z

Replaced the hand-rolled .npy caching with diskcache per @johnarevalo's suggestion. SQLite-backed ACID guarantees handle concurrency properly, so the atomic-write workaround and corruption recovery are no longer needed. Net reduction of ~40 lines.

afermg · 2026-03-30T15:17:16Z

I am not sure if we should be using a dependency that hasn't been updated in three years. Sure, the functionality shouldn't change and if the library is complete it is, and it is pure-python which is great, but it is always a risk for things to stop working because Python. My concern is that it is hard to know if it is abandonware (according to some it is, based on the topics linked to grantjenks/python-diskcache#357). Some folks forked and fixed the CVE https://github.com/wandb/weave/pull/6389/changes, but that is a temporary patch. I'm not sure what is the best course of action, but I do like diskcache or something similar.

shntnu · 2026-03-30T15:22:43Z

@afermg - I'm not pondering this too deeply, but it is certainly an option to revert b4e8679 and build on the previous one (which has no additional dependencies)

afermg · 2026-03-30T16:04:42Z

I have consulted with the High Council (@leoank and @gnodar01). The overall conclusion is that even if there is an issue down the line the library is small and simple enough that any of us could fork it and get it working with relative ease. We can squash and merge, provided the tests pass (I don't know why the workflow is not running).

afermg · 2026-03-30T16:14:14Z

Tests pass locally. I re-read the code and the change looks small enough to me. Feel free to bring up any issues, otherwise I will squash and merge at the end of the day.

❄️impure .venv ❯  pytest
============================= test session starts ==============================platform linux -- Python 3.11.11, pytest-9.0.1, pluggy-1.6.0
rootdir: /home/amunoz/projects/copairs
configfile: pyproject.toml
plugins: anyio-4.11.0
collected 67 items                                                             

tests/test_build_rank_multilabel.py .                                    [  1%]
tests/test_compute.py ...........                                        [ 17%]
tests/test_hierarchical_fdr.py ....                                      [ 23%]
tests/test_map.py ............                                           [ 41%]
tests/test_map_filter.py ....                                            [ 47%]
tests/test_matching.py ..........                                        [ 62%]
tests/test_matching_any.py ....                                          [ 68%]
tests/test_matching_multilabel.py ........                               [ 80%]
tests/test_normalization.py .....                                        [ 88%]
tests/test_normalization_integration.py ....                             [ 94%]
tests/test_reference_index.py .                                          [ 95%]
tests/test_replicating.py ...                                            [100%]

=============================== warnings summary ===============================tests/test_matching.py::test_raise_distjoint
  /home/amunoz/projects/copairs/src/copairs/matching.py:494: DeprecationWarning: Passing strings to 'sameby' is deprecated and will be removed in v0.5.3.
    sameby, diffby = _validate(sameby, diffby)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================== 67 passed, 1 warning in 46.17s ========================~/projects/copairs remotes/cytomining/fix/atomic-cache-writes* ≡ 46s
❄️impure .venv ❯

afermg reviewed Mar 26, 2026

View reviewed changes

afermg merged commit 620ec95 into main Apr 3, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: atomic cache writes to prevent corruption from parallel workers#128

fix: atomic cache writes to prevent corruption from parallel workers#128
afermg merged 2 commits intomainfrom
fix/atomic-cache-writes

shntnu commented Mar 25, 2026

Uh oh!

afermg left a comment •

edited

Loading

Uh oh!

johnarevalo commented Mar 26, 2026

Uh oh!

shntnu commented Mar 28, 2026

Uh oh!

afermg commented Mar 30, 2026

Uh oh!

shntnu commented Mar 30, 2026 •

edited

Loading

Uh oh!

afermg commented Mar 30, 2026

Uh oh!

afermg commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shntnu commented Mar 25, 2026

Summary

Problem

Fix

Stress test

Test plan

Uh oh!

afermg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnarevalo commented Mar 26, 2026

Uh oh!

shntnu commented Mar 28, 2026

Uh oh!

afermg commented Mar 30, 2026

Uh oh!

shntnu commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afermg commented Mar 30, 2026

Uh oh!

afermg commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

afermg left a comment •

edited

Loading

shntnu commented Mar 30, 2026 •

edited

Loading