Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dir iteration being broken by concurrent removes #1068

Merged
merged 1 commit into from
Mar 20, 2025
Merged

Conversation

geky
Copy link
Member

@geky geky commented Feb 4, 2025

When removing a file, we mark all open handles as "removed" (pair={-1,-1}) to avoid trying to later read metadata that no longer exists. Unfortunately, this also includes open dir handles that happen to be pointing at the removed file, causing them to return LFS_ERR_CORRUPT on the next read.

The good news is this is not actual filesystem corruption, only a logic error in lfs_dir_read.

We actually already have logic in place to nudge the dir to the next id, but it was unreachable with the existing logic. I suspect this worked at one point but was broken during a refactor due to lack of testing.


Fortunately, all we need to do is not clobber the handle if the internal type is a dir. Then the dir-nudging logic can correctly take over.

I've also added test_dirs_remove_read to test this and prevent another regression, adapted from tests provided by @tpwrules that identified the original bug.

Found by @tpwrules
See #1061 for more info

When removing a file, we mark all open handles as "removed" (
pair={-1,-1}) to avoid trying to later read metadata that no longer
exists. Unfortunately, this also includes open dir handles that happen
to be pointing at the removed file, causing them to return
LFS_ERR_CORRUPT on the next read.

The good news is this is _not_ actual filesystem corruption, only a
logic error in lfs_dir_read.

We actually already have logic in place to nudge the dir to the next id,
but it was unreachable with the existing logic. I suspect this worked at
one point but was broken during a refactor due to lack of testing.

---

Fortunately, all we need to do is _not_ clobber the handle if the
internal type is a dir. Then the dir-nudging logic can correctly take
over.

I've also added test_dirs_remove_read to test this and prevent another
regression, adapted from tests provided by tpwrules that identified the
original bug.

Found by tpwrules
@geky geky force-pushed the fix-dir-remove-read branch from a39cb26 to caba4f3 Compare February 4, 2025 04:52
@geky-bot
Copy link
Collaborator

geky-bot commented Feb 4, 2025

Tests passed ✓, Code: 17112 B (+0.0%), Stack: 1448 B (+0.0%), Structs: 812 B (+0.0%)
Code Stack Structs Coverage
Default 17112 B (+0.0%) 1448 B (+0.0%) 812 B (+0.0%) Lines 2430/2592 lines (+0.1%)
Readonly 6222 B (+0.0%) 448 B (+0.0%) 812 B (+0.0%) Branches 1280/1610 branches (+0.0%)
Threadsafe 17964 B (+0.0%) 1448 B (+0.0%) 820 B (+0.0%) Benchmarks
Multiversion 17184 B (+0.0%) 1448 B (+0.0%) 816 B (+0.0%) Readed 29369693876 B (+0.0%)
Migrate 18776 B (+0.0%) 1752 B (+0.0%) 816 B (+0.0%) Proged 1482874766 B (+0.0%)
Error-asserts 17892 B (+0.0%) 1440 B (+0.0%) 812 B (+0.0%) Erased 1568888832 B (+0.0%)

@tpwrules
Copy link

Thanks a bunch, I was able to test in our application with this patch on top of 2.10.1 and it seems to fix the bug we observed.

Do you think the case of removing multiple files between directory reads is adequately tested? My tests from the issue do pass after this patch too.

@geky
Copy link
Member Author

geky commented Feb 17, 2025

Do you think the case of removing multiple files between directory reads is adequately tested? My tests from the issue do pass after this patch too.

I'm not opposed to accepting more tests :)

The problem is you can always have more testing, and my attention is elsewhere on a changeset that will eventually add more testing (rbyd). So hardening the current state of things is unfortunately not a priority for me right now.

I'll respond with more in #1061 (comment)

@geky geky changed the base branch from master to devel March 20, 2025 06:21
@geky geky merged commit 63ab1ff into devel Mar 20, 2025
109 checks passed
@geky
Copy link
Member Author

geky commented Mar 20, 2025

@tpwrules Sorry about the delay on bringing this in. Unrelated PRs delayed the release a bit. Thanks for raising the issue on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants