Skip to content

Commit 952f048

Browse files
adam900710gregkh
authored andcommitted
btrfs: protect folio::private when attaching extent buffer folios
commit f3a5367 upstream. [BUG] Since v6.8 there are rare kernel crashes reported by various people, the common factor is bad page status error messages like this: BUG: Bad page state in process kswapd0 pfn:d6e840 page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c pfn:0xd6e840 aops:btree_aops ino:1 flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff) page_type: 0xffffffff() raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0 raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: non-NULL mapping [CAUSE] Commit 09e6cef ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method") changes the sequence when allocating a new extent buffer. Previously we always called grab_extent_buffer() under mapping->i_private_lock, to ensure the safety on modification on folio::private (which is a pointer to extent buffer for regular sectorsize). This can lead to the following race: Thread A is trying to allocate an extent buffer at bytenr X, with 4 4K pages, meanwhile thread B is trying to release the page at X + 4K (the second page of the extent buffer at X). Thread A | Thread B -----------------------------------+------------------------------------- | btree_release_folio() | | This is for the page at X + 4K, | | Not page X. | | alloc_extent_buffer() | |- release_extent_buffer() |- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs) | page at bytenr X (the first | | | | page). | | | | Which returned -EEXIST. | | | | | | | |- filemap_lock_folio() | | | | Returned the first page locked. | | | | | | | |- grab_extent_buffer() | | | | |- atomic_inc_not_zero() | | | | | Returned false | | | | |- folio_detach_private() | | |- folio_detach_private() for X | |- folio_test_private() | | |- folio_test_private() | Returned true | | | Returned true |- folio_put() | |- folio_put() Now there are two puts on the same folio at folio X, leading to refcount underflow of the folio X, and eventually causing the BUG_ON() on the page->mapping. The condition is not that easy to hit: - The release must be triggered for the middle page of an eb If the release is on the same first page of an eb, page lock would kick in and prevent the race. - folio_detach_private() has a very small race window It's only between folio_test_private() and folio_clear_private(). That's exactly when mapping->i_private_lock is used to prevent such race, and commit 09e6cef ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method") screwed that up. At that time, I thought the page lock would kick in as filemap_release_folio() also requires the page to be locked, but forgot the filemap_release_folio() only locks one page, not all pages of an extent buffer. [FIX] Move all the code requiring i_private_lock into attach_eb_folio_to_filemap(), so that everything is done with proper lock protection. Furthermore to prevent future problems, add an extra lockdep_assert_locked() to ensure we're holding the proper lock. To reproducer that is able to hit the race (takes a few minutes with instrumented code inserting delays to alloc_extent_buffer()): #!/bin/sh drop_caches () { while(true); do echo 3 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/compact_memory done } run_tar () { while(true); do for x in `seq 1 80` ; do tar cf /dev/zero /mnt > /dev/null & done wait done } mkfs.btrfs -f -d single -m single /dev/vda mount -o noatime /dev/vda /mnt # create 200,000 files, 1K each ./simoop -n 200000 -E -f 1k /mnt drop_caches & (run_tar) Reported-by: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/ Reported-by: Mikhail Gavrilov <[email protected]> Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/ Reported-by: Toralf Förster <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Reported-by: [email protected] Fixes: 09e6cef ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method") CC: [email protected] # 6.8+ CC: Chris Mason <[email protected]> Reviewed-by: Filipe Manana <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
1 parent 522fb1a commit 952f048

File tree

1 file changed

+31
-29
lines changed

1 file changed

+31
-29
lines changed

fs/btrfs/extent_io.c

Lines changed: 31 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -3662,6 +3662,8 @@ static struct extent_buffer *grab_extent_buffer(
36623662
struct folio *folio = page_folio(page);
36633663
struct extent_buffer *exists;
36643664

3665+
lockdep_assert_held(&page->mapping->i_private_lock);
3666+
36653667
/*
36663668
* For subpage case, we completely rely on radix tree to ensure we
36673669
* don't try to insert two ebs for the same bytenr. So here we always
@@ -3729,13 +3731,14 @@ static int check_eb_alignment(struct btrfs_fs_info *fs_info, u64 start)
37293731
* The caller needs to free the existing folios and retry using the same order.
37303732
*/
37313733
static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
3734+
struct btrfs_subpage *prealloc,
37323735
struct extent_buffer **found_eb_ret)
37333736
{
37343737

37353738
struct btrfs_fs_info *fs_info = eb->fs_info;
37363739
struct address_space *mapping = fs_info->btree_inode->i_mapping;
37373740
const unsigned long index = eb->start >> PAGE_SHIFT;
3738-
struct folio *existing_folio;
3741+
struct folio *existing_folio = NULL;
37393742
int ret;
37403743

37413744
ASSERT(found_eb_ret);
@@ -3747,12 +3750,14 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
37473750
ret = filemap_add_folio(mapping, eb->folios[i], index + i,
37483751
GFP_NOFS | __GFP_NOFAIL);
37493752
if (!ret)
3750-
return 0;
3753+
goto finish;
37513754

37523755
existing_folio = filemap_lock_folio(mapping, index + i);
37533756
/* The page cache only exists for a very short time, just retry. */
3754-
if (IS_ERR(existing_folio))
3757+
if (IS_ERR(existing_folio)) {
3758+
existing_folio = NULL;
37553759
goto retry;
3760+
}
37563761

37573762
/* For now, we should only have single-page folios for btree inode. */
37583763
ASSERT(folio_nr_pages(existing_folio) == 1);
@@ -3763,21 +3768,21 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
37633768
return -EAGAIN;
37643769
}
37653770

3766-
if (fs_info->nodesize < PAGE_SIZE) {
3767-
/*
3768-
* We're going to reuse the existing page, can drop our page
3769-
* and subpage structure now.
3770-
*/
3771+
finish:
3772+
spin_lock(&mapping->i_private_lock);
3773+
if (existing_folio && fs_info->nodesize < PAGE_SIZE) {
3774+
/* We're going to reuse the existing page, can drop our folio now. */
37713775
__free_page(folio_page(eb->folios[i], 0));
37723776
eb->folios[i] = existing_folio;
3773-
} else {
3777+
} else if (existing_folio) {
37743778
struct extent_buffer *existing_eb;
37753779

37763780
existing_eb = grab_extent_buffer(fs_info,
37773781
folio_page(existing_folio, 0));
37783782
if (existing_eb) {
37793783
/* The extent buffer still exists, we can use it directly. */
37803784
*found_eb_ret = existing_eb;
3785+
spin_unlock(&mapping->i_private_lock);
37813786
folio_unlock(existing_folio);
37823787
folio_put(existing_folio);
37833788
return 1;
@@ -3786,6 +3791,22 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
37863791
__free_page(folio_page(eb->folios[i], 0));
37873792
eb->folios[i] = existing_folio;
37883793
}
3794+
eb->folio_size = folio_size(eb->folios[i]);
3795+
eb->folio_shift = folio_shift(eb->folios[i]);
3796+
/* Should not fail, as we have preallocated the memory. */
3797+
ret = attach_extent_buffer_folio(eb, eb->folios[i], prealloc);
3798+
ASSERT(!ret);
3799+
/*
3800+
* To inform we have an extra eb under allocation, so that
3801+
* detach_extent_buffer_page() won't release the folio private when the
3802+
* eb hasn't been inserted into radix tree yet.
3803+
*
3804+
* The ref will be decreased when the eb releases the page, in
3805+
* detach_extent_buffer_page(). Thus needs no special handling in the
3806+
* error path.
3807+
*/
3808+
btrfs_folio_inc_eb_refs(fs_info, eb->folios[i]);
3809+
spin_unlock(&mapping->i_private_lock);
37893810
return 0;
37903811
}
37913812

@@ -3797,7 +3818,6 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
37973818
int attached = 0;
37983819
struct extent_buffer *eb;
37993820
struct extent_buffer *existing_eb = NULL;
3800-
struct address_space *mapping = fs_info->btree_inode->i_mapping;
38013821
struct btrfs_subpage *prealloc = NULL;
38023822
u64 lockdep_owner = owner_root;
38033823
bool page_contig = true;
@@ -3863,7 +3883,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
38633883
for (int i = 0; i < num_folios; i++) {
38643884
struct folio *folio;
38653885

3866-
ret = attach_eb_folio_to_filemap(eb, i, &existing_eb);
3886+
ret = attach_eb_folio_to_filemap(eb, i, prealloc, &existing_eb);
38673887
if (ret > 0) {
38683888
ASSERT(existing_eb);
38693889
goto out;
@@ -3900,24 +3920,6 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
39003920
* and free the allocated page.
39013921
*/
39023922
folio = eb->folios[i];
3903-
eb->folio_size = folio_size(folio);
3904-
eb->folio_shift = folio_shift(folio);
3905-
spin_lock(&mapping->i_private_lock);
3906-
/* Should not fail, as we have preallocated the memory */
3907-
ret = attach_extent_buffer_folio(eb, folio, prealloc);
3908-
ASSERT(!ret);
3909-
/*
3910-
* To inform we have extra eb under allocation, so that
3911-
* detach_extent_buffer_page() won't release the folio private
3912-
* when the eb hasn't yet been inserted into radix tree.
3913-
*
3914-
* The ref will be decreased when the eb released the page, in
3915-
* detach_extent_buffer_page().
3916-
* Thus needs no special handling in error path.
3917-
*/
3918-
btrfs_folio_inc_eb_refs(fs_info, folio);
3919-
spin_unlock(&mapping->i_private_lock);
3920-
39213923
WARN_ON(btrfs_folio_test_dirty(fs_info, folio, eb->start, eb->len));
39223924

39233925
/*

0 commit comments

Comments
 (0)