-
Notifications
You must be signed in to change notification settings - Fork 843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corruption reported upon concurrent directory modification #1061
Comments
Hi @tpwrules, thanks for reporting the issue, and the easily reproducible test case. The good news is this doesn't appear to be actual filesystem corruption, but just a logic mistake caused by dir handles being incorrectly marked as "removed" during iteration. I've pushed up a fix here: #1068, let me know if you see any issues with it. I've also added a slightly more aggressive version of your test to hopefully prevent a regression. This probably deserves more testing, but this is really just a quick fix. There are some bigger changes that affect dirs bubbling down the pipeline, and these include much more aggressive tests: $ ./scripts/test.py -L test_dread
case flags perms
test_dread_tell - 20/20
test_dread_rewind - 20/20
test_dread_seek - 20/20
test_dread_read_idempotent - 5/8
test_dread_read_neighbor_mkdirs - 270/270
test_dread_read_neighbor_rms - 270/270
test_dread_read_neighbor_mvs - 270/270
test_dread_read_with_mkdirs - 540/864
test_dread_read_with_rms - 375/600
test_dread_read_with_mvs - 6750/9000
test_dread_read_with_2_mkdirs - 2700/5184
test_dread_read_with_2_rms - 1500/3000
test_dread_read_with_2_mvs - 5400/9000
test_dread_read_rm - 72/72
test_dread_read_rm_remkdir - 72/72
test_dread_recursive_rm r 90/120
test_dread_recursive_mv r 270/360 I feel like I'm responding in a weird order, but now that the actionable info is out of the way...
I agree. Concurrent modification can get pretty hairy, but at the very least you would expect I think it returning all untouched entries should also be pretty doable (at least I'm struggling to come up with a case where this isn't true in littlefs's current design). I think issues come up when directories are built on hash-tables. It can be difficult to know what you've seen if a directory modification causes all the entries to get reshuffled. Fortunately, littlefs instead relies on directories being sorted, so it doesn't have that issue (even with future plans with B-trees). |
Thanks for the clarification and the fix. We modified our application to hold a lock while opendir is active so that other filesystem operations can't disturb the state and break it. Do you think once this fix is in we can remove this modification, or do you suspect there might be other issues lurking? To be honest, if we didn't have another bug that completely exploded everything when readdir returned an error, we probably would never have noticed this.
Does this mean that, in a bugless vacuum etc., littlefs is currently POSIX compliant in this respect and will always return all untouched entries? Or are tweaks to make this happen a future goal? |
In a bugless vacuum, yes. This is the current state of things. I think it's also safe to say this won't change without a major API version bump. And unlikely to change without disk changes.
Hmm, this is a difficult question for me to answer. I believe the concurrent remove logic is sound, but I also believed the previous logic was sound. At the end of the day you can only trust test results, and the possible test permutations of concurrent removes is quite large. The test added in #1068 tests all possible removes with all possible cursor locations in a directory with 10 files. Is that sufficient in addition to the existing dir tests? That's sort of up to you. The problem is there can always be more tests. At the end of the day, it may be useful to have application-level tests that check whatever logic depends on the behavior of concurrent removes. This would help build confidence that everything underneath your application is working as intended, and would in theory be portable to other filesystems. Not that this replaces the need to littlefs-level tests. There can always be more tests. It is surprising that this went unnoticed for so long. I suspect it's a mixture of most embedded applications not really doing complex directory operations, users ignoring the type of errors, or users assuming concurrent modification => LFS_ERR_CORRUPT is expected. The fact that the LFS_ERR_CORRUPT is a false positive probably didn't help. |
I have an embedded application where one part of the application might be iterating a directory while another deletes files from it. The outcome of the iteration isn't meaningful to the application in this case, but sometimes
lfs_dir_read
returnsLFS_ERR_CORRUPT
. I would expect it to return some subset of the files in the directory (possibly none at all) but complete without any errors.POSIX says that a directory iteration is unspecified to report an entry if it's added or removed while the iteration is in progress, but otherwise I believe it's expected to. I'm not sure LittleFS needs to go that far and guarantee entries that aren't touched will be returned (it would be nice if it did), but it at least should not claim corruption.
Is there any actual corruption occurring? Can file operations safely be continued as long as the directory is closed after the error is reported? Can LittleFS be easily fixed to not claim corruption?
To put it another way, I would not expect the below tests to fail, but they do on v2.10.1 at least. Iteration shouldn't give an error if a file is removed before reads start, nor if multiple files are removed between reads.
The text was updated successfully, but these errors were encountered: