Corruption reported upon concurrent directory modification #1061

tpwrules · 2024-12-31T20:19:59Z

I have an embedded application where one part of the application might be iterating a directory while another deletes files from it. The outcome of the iteration isn't meaningful to the application in this case, but sometimes lfs_dir_read returns LFS_ERR_CORRUPT. I would expect it to return some subset of the files in the directory (possibly none at all) but complete without any errors.

POSIX says that a directory iteration is unspecified to report an entry if it's added or removed while the iteration is in progress, but otherwise I believe it's expected to. I'm not sure LittleFS needs to go that far and guarantee entries that aren't touched will be returned (it would be nice if it did), but it at least should not claim corruption.

Is there any actual corruption occurring? Can file operations safely be continued as long as the directory is closed after the error is reported? Can LittleFS be easily fixed to not claim corruption?

To put it another way, I would not expect the below tests to fail, but they do on v2.10.1 at least. Iteration shouldn't give an error if a file is removed before reads start, nor if multiple files are removed between reads.

[cases.test_dirs_read_remove_multi]
defines.K = [0, 5]
if = 'N < BLOCK_COUNT/2'
code = '''
    lfs_t lfs;
    lfs_format(&lfs, cfg) => 0;
    lfs_mount(&lfs, cfg) => 0;
    lfs_mkdir(&lfs, "prickly-pear") => 0;
    for (int i = 0; i < 10; i++) {
        char path[1024];
        sprintf(path, "prickly-pear/cactus%03d", i);
        lfs_file_t file;
        lfs_file_open(&lfs, &file, path,
                LFS_O_WRONLY | LFS_O_CREAT | LFS_O_EXCL) => 0;
        lfs_file_close(&lfs, &file) => 0;
    }

    lfs_dir_t dir;
    struct lfs_info info;
    char path[1024];

    lfs_dir_open(&lfs, &dir, "prickly-pear") => 0;
    lfs_dir_read(&lfs, &dir, &info) => 1;
    assert(info.type == LFS_TYPE_DIR);
    assert(strcmp(info.name, ".") == 0);
    lfs_dir_read(&lfs, &dir, &info) => 1;
    assert(info.type == LFS_TYPE_DIR);
    assert(strcmp(info.name, "..") == 0);

    lfs_dir_read(&lfs, &dir, &info) => 1;
    sprintf(path, "prickly-pear/cactus%03d", (int)F);
    lfs_remove(&lfs, path) => 0;
    sprintf(path, "prickly-pear/cactus%03d", (int)F+1);
    lfs_remove(&lfs, path) => 0;
    assert(lfs_dir_read(&lfs, &dir, &info) >= 0);

    lfs_dir_close(&lfs, &dir) => 0;
    lfs_unmount(&lfs) => 0;
'''

[cases.test_dirs_remove_read]
defines.F = [0, 5]
if = 'N < BLOCK_COUNT/2'
code = '''
    lfs_t lfs;
    lfs_format(&lfs, cfg) => 0;
    lfs_mount(&lfs, cfg) => 0;
    lfs_mkdir(&lfs, "prickly-pear") => 0;
    for (int i = 0; i < 10; i++) {
        char path[1024];
        sprintf(path, "prickly-pear/cactus%03d", i);
        lfs_file_t file;
        lfs_file_open(&lfs, &file, path,
                LFS_O_WRONLY | LFS_O_CREAT | LFS_O_EXCL) => 0;
        lfs_file_close(&lfs, &file) => 0;
    }

    lfs_dir_t dir;
    struct lfs_info info;
    char path[1024];

    lfs_dir_open(&lfs, &dir, "prickly-pear") => 0;
    lfs_dir_read(&lfs, &dir, &info) => 1;
    assert(info.type == LFS_TYPE_DIR);
    assert(strcmp(info.name, ".") == 0);
    lfs_dir_read(&lfs, &dir, &info) => 1;
    assert(info.type == LFS_TYPE_DIR);
    assert(strcmp(info.name, "..") == 0);

    sprintf(path, "prickly-pear/cactus%03d", (int)F);
    lfs_remove(&lfs, path) => 0;
    assert(lfs_dir_read(&lfs, &dir, &info) >= 0);

    lfs_dir_close(&lfs, &dir) => 0;
    lfs_unmount(&lfs) => 0;
'''

The text was updated successfully, but these errors were encountered:

geky · 2025-02-04T03:17:01Z

Hi @tpwrules, thanks for reporting the issue, and the easily reproducible test case.

The good news is this doesn't appear to be actual filesystem corruption, but just a logic mistake caused by dir handles being incorrectly marked as "removed" during iteration.

I've pushed up a fix here: #1068, let me know if you see any issues with it.

I've also added a slightly more aggressive version of your test to hopefully prevent a regression.

This probably deserves more testing, but this is really just a quick fix. There are some bigger changes that affect dirs bubbling down the pipeline, and these include much more aggressive tests:

$ ./scripts/test.py -L test_dread
case                                               flags           perms
test_dread_tell                                        -           20/20
test_dread_rewind                                      -           20/20
test_dread_seek                                        -           20/20
test_dread_read_idempotent                             -             5/8
test_dread_read_neighbor_mkdirs                        -         270/270
test_dread_read_neighbor_rms                           -         270/270
test_dread_read_neighbor_mvs                           -         270/270
test_dread_read_with_mkdirs                            -         540/864
test_dread_read_with_rms                               -         375/600
test_dread_read_with_mvs                               -       6750/9000
test_dread_read_with_2_mkdirs                          -       2700/5184
test_dread_read_with_2_rms                             -       1500/3000
test_dread_read_with_2_mvs                             -       5400/9000
test_dread_read_rm                                     -           72/72
test_dread_read_rm_remkdir                             -           72/72
test_dread_recursive_rm                                r          90/120
test_dread_recursive_mv                                r         270/360

I feel like I'm responding in a weird order, but now that the actionable info is out of the way...

POSIX says that a directory iteration is unspecified to report an entry if it's added or removed while the iteration is in progress, but otherwise I believe it's expected to. I'm not sure LittleFS needs to go that far and guarantee entries that aren't touched will be returned (it would be nice if it did), but it at least should not claim corruption.

I agree. Concurrent modification can get pretty hairy, but at the very least you would expect lfs_dir_read to return 0 early, or maybe LFS_ERR_INVAL/LFS_ERR_BUSY. LFS_ERR_CORRUPT should be reserved for unrecoverable on-disk corruption.

I think it returning all untouched entries should also be pretty doable (at least I'm struggling to come up with a case where this isn't true in littlefs's current design).

I think issues come up when directories are built on hash-tables. It can be difficult to know what you've seen if a directory modification causes all the entries to get reshuffled. Fortunately, littlefs instead relies on directories being sorted, so it doesn't have that issue (even with future plans with B-trees).

tpwrules · 2025-02-14T04:19:14Z

Thanks for the clarification and the fix. We modified our application to hold a lock while opendir is active so that other filesystem operations can't disturb the state and break it.

Do you think once this fix is in we can remove this modification, or do you suspect there might be other issues lurking? To be honest, if we didn't have another bug that completely exploded everything when readdir returned an error, we probably would never have noticed this.

I think it returning all untouched entries should also be pretty doable (at least I'm struggling to come up with a case where this isn't true in littlefs's current design).

Does this mean that, in a bugless vacuum etc., littlefs is currently POSIX compliant in this respect and will always return all untouched entries? Or are tweaks to make this happen a future goal?

geky · 2025-02-17T04:42:28Z

Does this mean that, in a bugless vacuum etc., littlefs is currently POSIX compliant in this respect and will always return all untouched entries? Or are tweaks to make this happen a future goal?

In a bugless vacuum, yes. This is the current state of things.

I think it's also safe to say this won't change without a major API version bump. And unlikely to change without disk changes.

Do you think once this fix is in we can remove this modification, or do you suspect there might be other issues lurking? To be honest, if we didn't have another bug that completely exploded everything when readdir returned an error, we probably would never have noticed this.

Hmm, this is a difficult question for me to answer. I believe the concurrent remove logic is sound, but I also believed the previous logic was sound. At the end of the day you can only trust test results, and the possible test permutations of concurrent removes is quite large.

The test added in #1068 tests all possible removes with all possible cursor locations in a directory with 10 files. Is that sufficient in addition to the existing dir tests? That's sort of up to you.

The problem is there can always be more tests.

At the end of the day, it may be useful to have application-level tests that check whatever logic depends on the behavior of concurrent removes. This would help build confidence that everything underneath your application is working as intended, and would in theory be portable to other filesystems.

Not that this replaces the need to littlefs-level tests. There can always be more tests.

It is surprising that this went unnoticed for so long. I suspect it's a mixture of most embedded applications not really doing complex directory operations, users ignoring the type of errors, or users assuming concurrent modification => LFS_ERR_CORRUPT is expected.

The fact that the LFS_ERR_CORRUPT is a false positive probably didn't help.

tpwrules mentioned this issue Dec 31, 2024

Support for scripting in flash using littlefs ArduPilot/ardupilot#28724

Merged

geky added the needs fix we know what is wrong label Feb 3, 2025

geky mentioned this issue Feb 4, 2025

Fix dir iteration being broken by concurrent removes #1068

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corruption reported upon concurrent directory modification #1061

Corruption reported upon concurrent directory modification #1061

tpwrules commented Dec 31, 2024

geky commented Feb 4, 2025

tpwrules commented Feb 14, 2025 •

edited

Loading

geky commented Feb 17, 2025

Corruption reported upon concurrent directory modification #1061

Corruption reported upon concurrent directory modification #1061

Comments

tpwrules commented Dec 31, 2024

geky commented Feb 4, 2025

tpwrules commented Feb 14, 2025 • edited Loading

geky commented Feb 17, 2025

tpwrules commented Feb 14, 2025 •

edited

Loading