Skip to content

Fix multi-source convert path collision (#442)#444

Draft
kenibrewer wants to merge 3 commits intomainfrom
kenibrewer/issue-442
Draft

Fix multi-source convert path collision (#442)#444
kenibrewer wants to merge 3 commits intomainfrom
kenibrewer/issue-442

Conversation

@kenibrewer
Copy link
Copy Markdown
Member

Description

Fixes #442. When cytotable.convert() was called with multiple per-source subdirectories that share a parent directory name (e.g. analyses/{1,2,3}/analysis/), _source_pageset_to_parquet wrote each source's intermediate parquet to the same path, causing ArrowInvalid (concurrent write/read race) or FileNotFoundError (later writer overwrites). The fix inserts a short, stable SHA-1 hash of the source's parent path into the intermediate directory, guaranteeing per-source uniqueness; cleanup in _concat_source_group is extended to walk up the additional level. A regression test (test_convert_multi_source_colliding_parent_dir_names) reproduces the original failure on main and passes with the fix.

What is the nature of your change?

  • Bug fix (fixes an issue).

Checklist

  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.

Sibling sources whose parent directories share a name (e.g.
analyses/{1,2,3}/analysis) wrote to identical intermediate parquet paths,
causing ArrowInvalid (race) or FileNotFoundError (overwrite). Add a stable
SHA-1 hash of the source's parent path to the intermediate directory, and
extend cleanup in _concat_source_group to remove the new dir and its parent.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 99c529fe-4652-4134-995b-86ebc2262a76

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kenibrewer/issue-442

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kenibrewer kenibrewer marked this pull request as draft April 29, 2026 16:24
pre-commit-ci-lite Bot and others added 2 commits April 29, 2026 16:25
- Mark sha1 path-discriminator hash as usedforsecurity=False (bandit B324).
- Narrow convert() return type with isinstance(dict) asserts in the new
  multi-source regression test so it passes mypy (test_convert.py is
  mypy-excluded; test_convert_threaded.py is not).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-source convert(): intermediate parquet paths collide across sources, causing write/read race (ArrowInvalid / FileNotFoundError)

1 participant