Skip to content

Gvrose fips legacy 8 compliant/4.18.0 425.13.1 #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

gvrose8192
Copy link

@gvrose8192 gvrose8192 commented Oct 21, 2024

Latest work on the FIPS 8 compliant kernel Various linked tasks:
VULN-429
VULN-4095
VULN-597
SECO-169
SECO-94

Kernel selftests have passed https://github.com/user-attachments/files/17512330/kernel-selftest.log with no change in results from previous run.

I've been running netfilter tests in a loop overnight - for i in {1..50000}; do sudo valgrind --log-file=valgrind-results$i.log ./run-tests.sh; done
Typical output, unchanged over any number runs.
nftables-test.log

Valgrind out put from any of hundreds of runs is all the same -
valgrind-results.log

With lockdep enabled and running sudo stress --cpu 28 --io 28 --vm 28 --vm-bytes 1G --timeout 3h I ran multiple passes of the nftables tests with valgrind: for i in {1..4}; do sudo valgrind --log-file=valgrind-results$i.log ./run-tests.sh; done

The nftables tests all pass with no difference from the original tests. Valgrind logs here:
valgrind-results1.log
valgrind-results2.log
valgrind-results3.log
valgrind-results4.log

No lockdep splats, any other splats, no OOMs, no panics, nor any other error messages during the run.

After pulling in the missing patch "netfilter: nf_tables: set backend .flush always succeeds" I reran the netfilter tests overnight with lockdep and kmemleak enabled in the kernel and running "sudo stress --cpu 28 --io 28 --vm 28 --vm-bytes 1G --timeout 8h" and found no issues. The logfiles are unchanged from the previous night's runs.

@gvrose8192 gvrose8192 requested a review from PlaidCat October 21, 2024 21:12
@gvrose8192
Copy link
Author

The 2nd commit is crap - the code is right but whacked the commit message metadata. I'll fix that up. Leaving the rest for your review.

@gvrose8192 gvrose8192 force-pushed the gvrose_fips-legacy-8-compliant/4.18.0-425.13.1 branch from 22f98fe to 3f2fa28 Compare October 21, 2024 21:23
@gvrose8192
Copy link
Author

The 2nd commit is crap - the code is right but whacked the commit message metadata. I'll fix that up. Leaving the rest for your review.

OK, fixed with a force push.

@gvrose8192 gvrose8192 force-pushed the gvrose_fips-legacy-8-compliant/4.18.0-425.13.1 branch 2 times, most recently from 8f50770 to 9cff1f7 Compare October 23, 2024 00:31
@gvrose8192
Copy link
Author

This PR is now ready for full review.
CVE's addressed by this PR:
CVE-2023-4244
CVE-2023-52581
CVE-2024-26925

Github actions build checks here:
https://github.com/ctrliq/kernel-src-tree/actions/runs/11505520259 Checked the PR for valid commit messages
https://github.com/ctrliq/kernel-src-tree/actions/runs/11505519609 Checked the compile/build for x86_64
https://github.com/ctrliq/kernel-src-tree/actions/runs/11505519601 Checked the compile/build for aarch64 - this is not really valid for fips8, but it demonstrates the code changes are portable.

Kernel selftest log shows no new errors or consistent discrepancies from the base kernel from before this PR. I.E things that failed before still fail, things that passed before, still pass.

kernel-selftest.log

What remains to do:

  1. This will require some close inspection of those commits marked with an 'upstream-diff' tag.
  2. Netfilter testsuite - will run it against the current fips-compliant8 branch and then against the same branch with this PR. Checking for no new errors. If something passes that didn't used to then that's great and I will note it in the PR conversation.

@PlaidCat This PR is ready for review. I'll be configuring and running the netfilter tests in parallel and record the results here.

@gvrose8192
Copy link
Author

Oh - totally forgot about this included patch: 5647beb

So that adds an additional CVE fixed by this PR - CVE-2024-39502

So we have 4 total CVEs addressed by this PR, not 3.

@gvrose8192
Copy link
Author

nft-test-results.log

Sample results from an nftables testsuite run on my dev system running Rocky 9.4. This was a sanity check to make sure I could actually build and install the nftables testsuite available here: https://git.netfilter.org/nftables/

Next step is collect the results from a run with the currently available fips8-compliant kernel and compare to the results when I run the kernel built from this PR.

Copy link
Collaborator

@PlaidCat PlaidCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: previous version invalid do to wrong reference

Copy link
Collaborator

@PlaidCat PlaidCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: previous version invalid do to wrong reference

@gvrose8192
Copy link
Author

"in nf_tables_commit under case NFT_MSG_NEWSETELEM: also uses nft_setelem_remove
https://github.com/ctrliq/kernel-src-tree/blob/centos_kernel-4.18.0-534.el8/net/netfilter/nf_tables_api.c#L8341

Same as above in __nf_tables_abort NFT_MSG_NEWSETELEM
https://github.com/ctrliq/kernel-src-tree/blob/centos_kernel-4.18.0-534.el8/net/netfilter/nf_tables_api.c#L8547"

Acked - requires more investigation.

@gvrose8192 gvrose8192 force-pushed the gvrose_fips-legacy-8-compliant/4.18.0-425.13.1 branch from 35ee0b2 to c7185b5 Compare October 26, 2024 18:18
@gvrose8192
Copy link
Author

Closing this pull request - will post an updated PR

@gvrose8192 gvrose8192 closed this Oct 28, 2024
@gvrose8192 gvrose8192 reopened this Oct 28, 2024
@gvrose8192 gvrose8192 force-pushed the gvrose_fips-legacy-8-compliant/4.18.0-425.13.1 branch 2 times, most recently from e82f39b to 240a26d Compare October 28, 2024 20:29
@gvrose8192
Copy link
Author

All kernel selftests continue to show no new errors or consistent discrepancies between the base version and with this patch series.
I am running the nftables testing run-tests.sh in continuous loop with valgrind. No memory leaks detected so far after hundreds of loops. The logs are too big to store but I ran a single loops manually and got the following results:

test-results.log
valgrind-results.log

I'll resume valgrind checking of the nftables nfct checks for an overnight run, make sure no long term (within a day) damage is found and to increase confidence in the PR.

Copy link
Collaborator

@PlaidCat PlaidCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For netfilter: nf_tables: mark set as dead when unbinding anonymous set with timeout 240a26d

it should be jira VULN-835

@gvrose8192
Copy link
Author

For netfilter: nf_tables: mark set as dead when unbinding anonymous set with timeout 240a26d

it should be jira VULN-835

Good catch - I wondered about pulling that in with a different jira and meant to ask you but then got distracted by other work. I'll fix that up.

@gvrose8192
Copy link
Author

"in nf_tables_commit under case NFT_MSG_NEWSETELEM: also uses nft_setelem_remove https://github.com/ctrliq/kernel-src-tree/blob/centos_kernel-4.18.0-534.el8/net/netfilter/nf_tables_api.c#L8341

Same as above in __nf_tables_abort NFT_MSG_NEWSETELEM https://github.com/ctrliq/kernel-src-tree/blob/centos_kernel-4.18.0-534.el8/net/netfilter/nf_tables_api.c#L8547"

Acked - requires more investigation.

OK, yes. Found and fixed - just missed it in an otherwise large commit. Fix incoming with next branch force push.

Copy link
Collaborator

@PlaidCat PlaidCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subsystem-sync netfilter:nf_tables 4.18.0-553
These should be:
subsystem-sync netfilter:nf_tables 4.18.0-534

Copy link
Collaborator

@PlaidCat PlaidCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

netfilter: nf_tables: fix table flag updates 95b04bb

Is the upstream-diff here just contextual information in the fuzz?

github-actions bot pushed a commit that referenced this pull request May 27, 2025

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
…_USERCOPY=y crash

Borislav Petkov reported the following boot crash on x86-32,
with CONFIG_HARDENED_USERCOPY=y:

  |  usercopy: Kernel memory overwrite attempt detected to SLUB object 'task_struct' (offset 2112, size 160)!
  |  ...
  |  kernel BUG at mm/usercopy.c:102!

So the useroffset and usersize arguments are what control the allowed
window of copying in/out of the "task_struct" kmem cache:

        /* create a slab on which task_structs can be allocated */
        task_struct_whitelist(&useroffset, &usersize);
        task_struct_cachep = kmem_cache_create_usercopy("task_struct",
                        arch_task_struct_size, align,
                        SLAB_PANIC|SLAB_ACCOUNT,
                        useroffset, usersize, NULL);

task_struct_whitelist() positions this window based on the location of
the thread_struct within task_struct, and gets the arch-specific details
via arch_thread_struct_whitelist(offset, size):

	static void __init task_struct_whitelist(unsigned long *offset, unsigned long *size)
	{
		/* Fetch thread_struct whitelist for the architecture. */
		arch_thread_struct_whitelist(offset, size);

		/*
		 * Handle zero-sized whitelist or empty thread_struct, otherwise
		 * adjust offset to position of thread_struct in task_struct.
		 */
		if (unlikely(*size == 0))
			*offset = 0;
		else
			*offset += offsetof(struct task_struct, thread);
	}

Commit cb7ca40 ("x86/fpu: Make task_struct::thread constant size")
removed the logic for the window, leaving:

	static inline void
	arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
	{
		*offset = 0;
		*size = 0;
	}

So now there is no window that usercopy hardening will allow to be copied
in/out of task_struct.

But as reported above, there *is* a copy in copy_uabi_to_xstate(). (It
seems there are several, actually.)

	int copy_sigframe_from_user_to_xstate(struct task_struct *tsk,
					      const void __user *ubuf)
	{
		return copy_uabi_to_xstate(x86_task_fpu(tsk)->fpstate, NULL, ubuf, &tsk->thread.pkru);
	}

This appears to be writing into x86_task_fpu(tsk)->fpstate. With or
without CONFIG_X86_DEBUG_FPU, this resolves to:

	((struct fpu *)((void *)(task) + sizeof(*(task))))

i.e. the memory "after task_struct" is cast to "struct fpu", and the
uses the "fpstate" pointer. How that pointer gets set looks to be
variable, but I think the one we care about here is:

        fpu->fpstate = &fpu->__fpstate;

And struct fpu::__fpstate says:

        struct fpstate                  __fpstate;
        /*
         * WARNING: '__fpstate' is dynamically-sized.  Do not put
         * anything after it here.
         */

So we're still dealing with a dynamically sized thing, even if it's not
within the literal struct task_struct -- it's still in the kmem cache,
though.

Looking at the kmem cache size, it has allocated "arch_task_struct_size"
bytes, which is calculated in fpu__init_task_struct_size():

        int task_size = sizeof(struct task_struct);

        task_size += sizeof(struct fpu);

        /*
         * Subtract off the static size of the register state.
         * It potentially has a bunch of padding.
         */
        task_size -= sizeof(union fpregs_state);

        /*
         * Add back the dynamically-calculated register state
         * size.
         */
        task_size += fpu_kernel_cfg.default_size;

        /*
         * We dynamically size 'struct fpu', so we require that
         * 'state' be at the end of 'it:
         */
        CHECK_MEMBER_AT_END_OF(struct fpu, __fpstate);

        arch_task_struct_size = task_size;

So, this is still copying out of the kmem cache for task_struct, and the
window seems unchanged (still fpu regs). This is what the window was
before:

	void fpu_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
	{
		*offset = offsetof(struct thread_struct, fpu.__fpstate.regs);
		*size = fpu_kernel_cfg.default_size;
	}

And the same commit I mentioned above removed it.

I think the misunderstanding is here:

  | The fpu_thread_struct_whitelist() quirk to hardened usercopy can be removed,
  | now that the FPU structure is not embedded in the task struct anymore, which
  | reduces text footprint a bit.

Yes, FPU is no longer in task_struct, but it IS in the kmem cache named
"task_struct", since the fpstate is still being allocated there.

Partially revert the earlier mentioned commit, along with a
recalculation of the fpstate regs location.

Fixes: cb7ca40 ("x86/fpu: Make task_struct::thread constant size")
Reported-by: Borislav Petkov (AMD) <[email protected]>
Tested-by: Borislav Petkov (AMD) <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Chang S. Bae <[email protected]>
Cc: Gustavo A. R. Silva <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/[email protected]/ # Discussion #1
Link: https://lore.kernel.org/r/202505041418.F47130C4C8@keescook             # Discussion #2
github-actions bot pushed a commit that referenced this pull request May 27, 2025
Move hctx debugfs/sysfs register out of freezing queue in
__blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
can be killed:

	#2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
	#1 (fs_reclaim){+.+.}-{0:0}:
	#0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs

And registering/un-registering hctx debugfs/sysfs does not require queue to
be frozen:

- hctx sysfs attributes show() are drained when removing kobject, and
  there isn't store() implementation for hctx sysfs attributes

- debugfs entry read() is drained too when removing debugfs directory,
  and there isn't write() implementation for hctx debugfs too

- so it is safe to register/unregister hctx sysfs/debugfs without
  freezing queue because the cod paths changes nothing, and we just
  need to keep hctx live

Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Nilay Shroff <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 27, 2025
…xit()

scheduler's ->exit() is called with queue frozen and elevator lock is held, and
wbt_enable_default() can't be called with queue frozen, otherwise the
following lockdep warning is triggered:

	#6 (&q->rq_qos_mutex){+.+.}-{4:4}:
	#5 (&eq->sysfs_lock){+.+.}-{4:4}:
	#4 (&q->elevator_lock){+.+.}-{4:4}:
	#3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
	#2 (fs_reclaim){+.+.}-{0:0}:
	#1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
	#0 (&q->debugfs_mutex){+.+.}-{4:4}:

Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
call it from elevator_change_done().

Meantime add disk->rqos_state_mutex for covering wbt state change, which
matches the purpose more than ->elevator_lock.

Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Nilay Shroff <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 27, 2025
Amir Goldstein <[email protected]> says:

This adds a test for fanotify mount ns notifications inside userns [1].

While working on the test I ended up making lots of cleanups to reduce
build dependency on make headers_install.

These patches got rid of the dependency for my kvm setup for the
affected filesystems tests.

Building with TOOLS_INCLUDES dir was recommended by John Hubbard [2].

NOTE #1: these patches are based on a merge of vfs-6.16.mount
(changes wrappers.h) into v6.15-rc5 (changes mount-notify_test.c),
so if this cleanup is acceptable, we should probably setup a selftests
branch for 6.16, so that it can be used to test the fanotify patches.

NOTE #2: some of the defines in wrappers.h are left for overlayfs and
mount_setattr tests, which were not converted to use TOOLS_INCLUDES.
I did not want to mess with those tests.

* patches from https://lore.kernel.org/[email protected]:
  selftests/fs/mount-notify: add a test variant running inside userns
  selftests/filesystems: create setup_userns() helper
  selftests/filesystems: create get_unique_mnt_id() helper
  selftests/fs/mount-notify: build with tools include dir
  selftests/mount_settattr: remove duplicate syscall definitions
  selftests/pidfd: move syscall definitions into wrappers.h
  selftests/fs/statmount: build with tools include dir
  selftests/filesystems: move wrapper.h out of overlayfs subdir

Link: https://lore.kernel.org/[email protected]
Signed-off-by: Christian Brauner <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 28, 2025
ACPICA commit 1c28da2242783579d59767617121035dafba18c3

This was originally done in NetBSD:
NetBSD/src@b69d1ac
and is the correct alternative to the smattering of `memcpy`s I
previously contributed to this repository.

This also sidesteps the newly strict checks added in UBSAN:
llvm/llvm-project@7926744

Before this change we see the following UBSAN stack trace in Fuchsia:

  #0    0x000021afcfdeca5e in acpi_rs_get_address_common(struct acpi_resource*, union aml_resource*) ../../third_party/acpica/source/components/resources/rsaddr.c:329 <platform-bus-x86.so>+0x6aca5e
  #1.2  0x000021982bc4af3c in ubsan_get_stack_trace() compiler-rt/lib/ubsan/ubsan_diag.cpp:41 <libclang_rt.asan.so>+0x41f3c
  #1.1  0x000021982bc4af3c in maybe_print_stack_trace() compiler-rt/lib/ubsan/ubsan_diag.cpp:51 <libclang_rt.asan.so>+0x41f3c
  #1    0x000021982bc4af3c in ~scoped_report() compiler-rt/lib/ubsan/ubsan_diag.cpp:395 <libclang_rt.asan.so>+0x41f3c
  #2    0x000021982bc4bb6f in handletype_mismatch_impl() compiler-rt/lib/ubsan/ubsan_handlers.cpp:137 <libclang_rt.asan.so>+0x42b6f
  #3    0x000021982bc4b723 in __ubsan_handle_type_mismatch_v1 compiler-rt/lib/ubsan/ubsan_handlers.cpp:142 <libclang_rt.asan.so>+0x42723
  #4    0x000021afcfdeca5e in acpi_rs_get_address_common(struct acpi_resource*, union aml_resource*) ../../third_party/acpica/source/components/resources/rsaddr.c:329 <platform-bus-x86.so>+0x6aca5e
  #5    0x000021afcfdf2089 in acpi_rs_convert_aml_to_resource(struct acpi_resource*, union aml_resource*, struct acpi_rsconvert_info*) ../../third_party/acpica/source/components/resources/rsmisc.c:355 <platform-bus-x86.so>+0x6b2089
  #6    0x000021afcfded169 in acpi_rs_convert_aml_to_resources(u8*, u32, u32, u8, void**) ../../third_party/acpica/source/components/resources/rslist.c:137 <platform-bus-x86.so>+0x6ad169
  #7    0x000021afcfe2d24a in acpi_ut_walk_aml_resources(struct acpi_walk_state*, u8*, acpi_size, acpi_walk_aml_callback, void**) ../../third_party/acpica/source/components/utilities/utresrc.c:237 <platform-bus-x86.so>+0x6ed24a
  #8    0x000021afcfde66b7 in acpi_rs_create_resource_list(union acpi_operand_object*, struct acpi_buffer*) ../../third_party/acpica/source/components/resources/rscreate.c:199 <platform-bus-x86.so>+0x6a66b7
  #9    0x000021afcfdf6979 in acpi_rs_get_method_data(acpi_handle, const char*, struct acpi_buffer*) ../../third_party/acpica/source/components/resources/rsutils.c:770 <platform-bus-x86.so>+0x6b6979
  #10   0x000021afcfdf708f in acpi_walk_resources(acpi_handle, char*, acpi_walk_resource_callback, void*) ../../third_party/acpica/source/components/resources/rsxface.c:731 <platform-bus-x86.so>+0x6b708f
  #11   0x000021afcfa95dcf in acpi::acpi_impl::walk_resources(acpi::acpi_impl*, acpi_handle, const char*, acpi::Acpi::resources_callable) ../../src/devices/board/lib/acpi/acpi-impl.cc:41 <platform-bus-x86.so>+0x355dcf
  #12   0x000021afcfaa8278 in acpi::device_builder::gather_resources(acpi::device_builder*, acpi::Acpi*, fidl::any_arena&, acpi::Manager*, acpi::device_builder::gather_resources_callback) ../../src/devices/board/lib/acpi/device-builder.cc:84 <platform-bus-x86.so>+0x368278
  #13   0x000021afcfbddb87 in acpi::Manager::configure_discovered_devices(acpi::Manager*) ../../src/devices/board/lib/acpi/manager.cc:75 <platform-bus-x86.so>+0x49db87
  #14   0x000021afcf99091d in publish_acpi_devices(acpi::Manager*, zx_device_t*, zx_device_t*) ../../src/devices/board/drivers/x86/acpi-nswalk.cc:95 <platform-bus-x86.so>+0x25091d
  #15   0x000021afcf9c1d4e in x86::X86::do_init(x86::X86*) ../../src/devices/board/drivers/x86/x86.cc:60 <platform-bus-x86.so>+0x281d4e
  #16   0x000021afcf9e33ad in λ(x86::X86::ddk_init::(anon class)*) ../../src/devices/board/drivers/x86/x86.cc:77 <platform-bus-x86.so>+0x2a33ad
  #17   0x000021afcf9e313e in fit::internal::target<(lambda at../../src/devices/board/drivers/x86/x86.cc:76:19), false, false, std::__2::allocator<std::byte>, void>::invoke(void*) ../../sdk/lib/fit/include/lib/fit/internal/function.h:183 <platform-bus-x86.so>+0x2a313e
  #18   0x000021afcfbab4c7 in fit::internal::function_base<16UL, false, void(), std::__2::allocator<std::byte>>::invoke(const fit::internal::function_base<16UL, false, void (), std::__2::allocator<std::byte> >*) ../../sdk/lib/fit/include/lib/fit/internal/function.h:522 <platform-bus-x86.so>+0x46b4c7
  #19   0x000021afcfbab342 in fit::function_impl<16UL, false, void(), std::__2::allocator<std::byte>>::operator()(const fit::function_impl<16UL, false, void (), std::__2::allocator<std::byte> >*) ../../sdk/lib/fit/include/lib/fit/function.h:315 <platform-bus-x86.so>+0x46b342
  #20   0x000021afcfcd98c3 in async::internal::retained_task::Handler(async_dispatcher_t*, async_task_t*, zx_status_t) ../../sdk/lib/async/task.cc:24 <platform-bus-x86.so>+0x5998c3
  #21   0x00002290f9924616 in λ(const driver_runtime::Dispatcher::post_task::(anon class)*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, zx_status_t) ../../src/devices/bin/driver_runtime/dispatcher.cc:789 <libdriver_runtime.so>+0x10a616
  #22   0x00002290f9924323 in fit::internal::target<(lambda at../../src/devices/bin/driver_runtime/dispatcher.cc:788:7), true, false, std::__2::allocator<std::byte>, void, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request>>, int>::invoke(void*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int) ../../sdk/lib/fit/include/lib/fit/internal/function.h:128 <libdriver_runtime.so>+0x10a323
  #23   0x00002290f9904b76 in fit::internal::function_base<24UL, true, void(std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request>>, int), std::__2::allocator<std::byte>>::invoke(const fit::internal::function_base<24UL, true, void (std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int), std::__2::allocator<std::byte> >*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int) ../../sdk/lib/fit/include/lib/fit/internal/function.h:522 <libdriver_runtime.so>+0xeab76
  #24   0x00002290f9904831 in fit::callback_impl<24UL, true, void(std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request>>, int), std::__2::allocator<std::byte>>::operator()(fit::callback_impl<24UL, true, void (std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int), std::__2::allocator<std::byte> >*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int) ../../sdk/lib/fit/include/lib/fit/function.h:471 <libdriver_runtime.so>+0xea831
  #25   0x00002290f98d5adc in driver_runtime::callback_request::Call(driver_runtime::callback_request*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, zx_status_t) ../../src/devices/bin/driver_runtime/callback_request.h:74 <libdriver_runtime.so>+0xbbadc
  #26   0x00002290f98e1e58 in driver_runtime::Dispatcher::dispatch_callback(driver_runtime::Dispatcher*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >) ../../src/devices/bin/driver_runtime/dispatcher.cc:1248 <libdriver_runtime.so>+0xc7e58
  #27   0x00002290f98e4159 in driver_runtime::Dispatcher::dispatch_callbacks(driver_runtime::Dispatcher*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../src/devices/bin/driver_runtime/dispatcher.cc:1308 <libdriver_runtime.so>+0xca159
  #28   0x00002290f9918414 in λ(const driver_runtime::Dispatcher::create_with_adder::(anon class)*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../src/devices/bin/driver_runtime/dispatcher.cc:353 <libdriver_runtime.so>+0xfe414
  #29   0x00002290f991812d in fit::internal::target<(lambda at../../src/devices/bin/driver_runtime/dispatcher.cc:351:7), true, false, std::__2::allocator<std::byte>, void, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter>>, fbl::ref_ptr<driver_runtime::Dispatcher>>::invoke(void*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../sdk/lib/fit/include/lib/fit/internal/function.h:128 <libdriver_runtime.so>+0xfe12d
  #30   0x00002290f9906fc7 in fit::internal::function_base<8UL, true, void(std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter>>, fbl::ref_ptr<driver_runtime::Dispatcher>), std::__2::allocator<std::byte>>::invoke(const fit::internal::function_base<8UL, true, void (std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>), std::__2::allocator<std::byte> >*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../sdk/lib/fit/include/lib/fit/internal/function.h:522 <libdriver_runtime.so>+0xecfc7
  #31   0x00002290f9906c66 in fit::function_impl<8UL, true, void(std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter>>, fbl::ref_ptr<driver_runtime::Dispatcher>), std::__2::allocator<std::byte>>::operator()(const fit::function_impl<8UL, true, void (std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>), std::__2::allocator<std::byte> >*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../sdk/lib/fit/include/lib/fit/function.h:315 <libdriver_runtime.so>+0xecc66
  #32   0x00002290f98e73d9 in driver_runtime::Dispatcher::event_waiter::invoke_callback(driver_runtime::Dispatcher::event_waiter*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../src/devices/bin/driver_runtime/dispatcher.h:543 <libdriver_runtime.so>+0xcd3d9
  #33   0x00002290f98e700d in driver_runtime::Dispatcher::event_waiter::handle_event(std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, async_dispatcher_t*, async::wait_base*, zx_status_t, zx_packet_signal_t const*) ../../src/devices/bin/driver_runtime/dispatcher.cc:1442 <libdriver_runtime.so>+0xcd00d
  #34   0x00002290f9918983 in async_loop_owned_event_handler<driver_runtime::Dispatcher::event_waiter>::handle_event(async_loop_owned_event_handler<driver_runtime::Dispatcher::event_waiter>*, async_dispatcher_t*, async::wait_base*, zx_status_t, zx_packet_signal_t const*) ../../src/devices/bin/driver_runtime/async_loop_owned_event_handler.h:59 <libdriver_runtime.so>+0xfe983
  #35   0x00002290f9918b9e in async::wait_method<async_loop_owned_event_handler<driver_runtime::Dispatcher::event_waiter>, &async_loop_owned_event_handler<driver_runtime::Dispatcher::event_waiter>::handle_event>::call_handler(async_dispatcher_t*, async_wait_t*, zx_status_t, zx_packet_signal_t const*) ../../sdk/lib/async/include/lib/async/cpp/wait.h:201 <libdriver_runtime.so>+0xfeb9e
  #36   0x00002290f99bf509 in async_loop_dispatch_wait(async_loop_t*, async_wait_t*, zx_status_t, zx_packet_signal_t const*) ../../sdk/lib/async-loop/loop.c:394 <libdriver_runtime.so>+0x1a5509
  #37   0x00002290f99b9958 in async_loop_run_once(async_loop_t*, zx_time_t) ../../sdk/lib/async-loop/loop.c:343 <libdriver_runtime.so>+0x19f958
  #38   0x00002290f99b9247 in async_loop_run(async_loop_t*, zx_time_t, _Bool) ../../sdk/lib/async-loop/loop.c:301 <libdriver_runtime.so>+0x19f247
  #39   0x00002290f99ba962 in async_loop_run_thread(void*) ../../sdk/lib/async-loop/loop.c:860 <libdriver_runtime.so>+0x1a0962
  #40   0x000041afd176ef30 in start_c11(void*) ../../zircon/third_party/ulib/musl/pthread/pthread_create.c:63 <libc.so>+0x84f30
  #41   0x000041afd18a448d in thread_trampoline(uintptr_t, uintptr_t) ../../zircon/system/ulib/runtime/thread.cc:100 <libc.so>+0x1ba48d

Link: acpica/acpica@1c28da22
Signed-off-by: Rafael J. Wysocki <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Tamir Duberstein <[email protected]>
[ rjw: Pick up the tag from Tamir ]
Signed-off-by: Rafael J. Wysocki <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 28, 2025
Lockdep reports a possible circular locking dependency [1] when
cpu_hotplug_lock is acquired inside store_local_boost(), after
policy->rwsem has already been taken by store().

However, the boost update is strictly per-policy and does not
access shared state or iterate over all policies.

Since policy->rwsem is already held, this is enough to serialize
against concurrent topology changes for the current policy.

Remove the cpus_read_lock() to resolve the lockdep warning and
avoid unnecessary locking.

 [1]
 ======================================================
 WARNING: possible circular locking dependency detected
 6.15.0-rc6-debug-gb01fc4eca73c #1 Not tainted
 ------------------------------------------------------
 power-profiles-/588 is trying to acquire lock:
 ffffffffb3a7d910 (cpu_hotplug_lock){++++}-{0:0}, at: store_local_boost+0x56/0xd0

 but task is already holding lock:
 ffff8b6e5a12c380 (&policy->rwsem){++++}-{4:4}, at: store+0x37/0x90

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #2 (&policy->rwsem){++++}-{4:4}:
        down_write+0x29/0xb0
        cpufreq_online+0x7e8/0xa40
        cpufreq_add_dev+0x82/0xa0
        subsys_interface_register+0x148/0x160
        cpufreq_register_driver+0x15d/0x260
        amd_pstate_register_driver+0x36/0x90
        amd_pstate_init+0x1e7/0x270
        do_one_initcall+0x68/0x2b0
        kernel_init_freeable+0x231/0x270
        kernel_init+0x15/0x130
        ret_from_fork+0x2c/0x50
        ret_from_fork_asm+0x11/0x20

 -> #1 (subsys mutex#3){+.+.}-{4:4}:
        __mutex_lock+0xc2/0x930
        subsys_interface_register+0x7f/0x160
        cpufreq_register_driver+0x15d/0x260
        amd_pstate_register_driver+0x36/0x90
        amd_pstate_init+0x1e7/0x270
        do_one_initcall+0x68/0x2b0
        kernel_init_freeable+0x231/0x270
        kernel_init+0x15/0x130
        ret_from_fork+0x2c/0x50
        ret_from_fork_asm+0x11/0x20

 -> #0 (cpu_hotplug_lock){++++}-{0:0}:
        __lock_acquire+0x10ed/0x1850
        lock_acquire.part.0+0x69/0x1b0
        cpus_read_lock+0x2a/0xc0
        store_local_boost+0x56/0xd0
        store+0x50/0x90
        kernfs_fop_write_iter+0x132/0x200
        vfs_write+0x2b3/0x590
        ksys_write+0x74/0xf0
        do_syscall_64+0xbb/0x1d0
        entry_SYSCALL_64_after_hwframe+0x56/0x5e

Signed-off-by: Seyediman Seyedarab <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Rafael J. Wysocki <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 29, 2025
JIRA: https://issues.redhat.com/browse/RHEL-79791
CVE: CVE-2024-56607

commit 8fac326
Author: Kalle Valo <[email protected]>
Date:   Mon Oct 7 19:59:27 2024 +0300

    wifi: ath12k: fix atomic calls in ath12k_mac_op_set_bitrate_mask()
    
    When I try to manually set bitrates:
    
    iw wlan0 set bitrates legacy-2.4 1
    
    I get sleeping from invalid context error, see below. Fix that by switching to
    use recently introduced ieee80211_iterate_stations_mtx().
    
    Do note that WCN6855 firmware is still crashing, I'm not sure if that firmware
    even supports bitrate WMI commands and should we consider disabling
    ath12k_mac_op_set_bitrate_mask() for WCN6855? But that's for another patch.
    
    BUG: sleeping function called from invalid context at drivers/net/wireless/ath/ath12k/wmi.c:420
    in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 2236, name: iw
    preempt_count: 0, expected: 0
    RCU nest depth: 1, expected: 0
    3 locks held by iw/2236:
     #0: ffffffffabc6f1d8 (cb_lock){++++}-{3:3}, at: genl_rcv+0x14/0x40
     #1: ffff888138410810 (&rdev->wiphy.mtx){+.+.}-{3:3}, at: nl80211_pre_doit+0x54d/0x800 [cfg80211]
     #2: ffffffffab2cfaa0 (rcu_read_lock){....}-{1:2}, at: ieee80211_iterate_stations_atomic+0x2f/0x200 [mac80211]
    CPU: 3 UID: 0 PID: 2236 Comm: iw Not tainted 6.11.0-rc7-wt-ath+ #1772
    Hardware name: Intel(R) Client Systems NUC8i7HVK/NUC8i7HVB, BIOS HNKBLi70.86A.0067.2021.0528.1339 05/28/2021
    Call Trace:
     <TASK>
     dump_stack_lvl+0xa4/0xe0
     dump_stack+0x10/0x20
     __might_resched+0x363/0x5a0
     ? __alloc_skb+0x165/0x340
     __might_sleep+0xad/0x160
     ath12k_wmi_cmd_send+0xb1/0x3d0 [ath12k]
     ? ath12k_wmi_init_wcn7850+0xa40/0xa40 [ath12k]
     ? __netdev_alloc_skb+0x45/0x7b0
     ? __asan_memset+0x39/0x40
     ? ath12k_wmi_alloc_skb+0xf0/0x150 [ath12k]
     ? reacquire_held_locks+0x4d0/0x4d0
     ath12k_wmi_set_peer_param+0x340/0x5b0 [ath12k]
     ath12k_mac_disable_peer_fixed_rate+0xa3/0x110 [ath12k]
     ? ath12k_mac_vdev_stop+0x4f0/0x4f0 [ath12k]
     ieee80211_iterate_stations_atomic+0xd4/0x200 [mac80211]
     ath12k_mac_op_set_bitrate_mask+0x5d2/0x1080 [ath12k]
     ? ath12k_mac_vif_chan+0x320/0x320 [ath12k]
     drv_set_bitrate_mask+0x267/0x470 [mac80211]
     ieee80211_set_bitrate_mask+0x4cc/0x8a0 [mac80211]
     ? __this_cpu_preempt_check+0x13/0x20
     nl80211_set_tx_bitrate_mask+0x2bc/0x530 [cfg80211]
     ? nl80211_parse_tx_bitrate_mask+0x2320/0x2320 [cfg80211]
     ? trace_contention_end+0xef/0x140
     ? rtnl_unlock+0x9/0x10
     ? nl80211_pre_doit+0x557/0x800 [cfg80211]
     genl_family_rcv_msg_doit+0x1f0/0x2e0
     ? genl_family_rcv_msg_attrs_parse.isra.0+0x250/0x250
     ? ns_capable+0x57/0xd0
     genl_family_rcv_msg+0x34c/0x600
     ? genl_family_rcv_msg_dumpit+0x310/0x310
     ? __lock_acquire+0xc62/0x1de0
     ? he_set_mcs_mask.isra.0+0x8d0/0x8d0 [cfg80211]
     ? nl80211_parse_tx_bitrate_mask+0x2320/0x2320 [cfg80211]
     ? cfg80211_external_auth_request+0x690/0x690 [cfg80211]
     genl_rcv_msg+0xa0/0x130
     netlink_rcv_skb+0x14c/0x400
     ? genl_family_rcv_msg+0x600/0x600
     ? netlink_ack+0xd70/0xd70
     ? rwsem_optimistic_spin+0x4f0/0x4f0
     ? genl_rcv+0x14/0x40
     ? down_read_killable+0x580/0x580
     ? netlink_deliver_tap+0x13e/0x350
     ? __this_cpu_preempt_check+0x13/0x20
     genl_rcv+0x23/0x40
     netlink_unicast+0x45e/0x790
     ? netlink_attachskb+0x7f0/0x7f0
     netlink_sendmsg+0x7eb/0xdb0
     ? netlink_unicast+0x790/0x790
     ? __this_cpu_preempt_check+0x13/0x20
     ? selinux_socket_sendmsg+0x31/0x40
     ? netlink_unicast+0x790/0x790
     __sock_sendmsg+0xc9/0x160
     ____sys_sendmsg+0x620/0x990
     ? kernel_sendmsg+0x30/0x30
     ? __copy_msghdr+0x410/0x410
     ? __kasan_check_read+0x11/0x20
     ? mark_lock+0xe6/0x1470
     ___sys_sendmsg+0xe9/0x170
     ? copy_msghdr_from_user+0x120/0x120
     ? __lock_acquire+0xc62/0x1de0
     ? do_fault_around+0x2c6/0x4e0
     ? do_user_addr_fault+0x8c1/0xde0
     ? reacquire_held_locks+0x220/0x4d0
     ? do_user_addr_fault+0x8c1/0xde0
     ? __kasan_check_read+0x11/0x20
     ? __fdget+0x4e/0x1d0
     ? sockfd_lookup_light+0x1a/0x170
     __sys_sendmsg+0xd2/0x180
     ? __sys_sendmsg_sock+0x20/0x20
     ? reacquire_held_locks+0x4d0/0x4d0
     ? debug_smp_processor_id+0x17/0x20
     __x64_sys_sendmsg+0x72/0xb0
     ? lockdep_hardirqs_on+0x7d/0x100
     x64_sys_call+0x894/0x9f0
     do_syscall_64+0x64/0x130
     entry_SYSCALL_64_after_hwframe+0x4b/0x53
    RIP: 0033:0x7f230fe04807
    Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
    RSP: 002b:00007ffe996a7ea8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 0000556f9f9c3390 RCX: 00007f230fe04807
    RDX: 0000000000000000 RSI: 00007ffe996a7ee0 RDI: 0000000000000003
    RBP: 0000556f9f9c88c0 R08: 0000000000000002 R09: 0000000000000000
    R10: 0000556f965ca190 R11: 0000000000000246 R12: 0000556f9f9c8780
    R13: 00007ffe996a7ee0 R14: 0000556f9f9c87d0 R15: 0000556f9f9c88c0
     </TASK>
    
    Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3
    
    Signed-off-by: Kalle Valo <[email protected]>
    Link: https://patch.msgid.link/[email protected]
    Signed-off-by: Jeff Johnson <[email protected]>

Signed-off-by: Jose Ignacio Tornos Martinez <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 30, 2025
JIRA: https://issues.redhat.com/browse/RHEL-75959
Upstream Status: v6.14-rc1

commit e4b6b66
Author:     Aaro Koskinen <[email protected]>
AuthorDate: Thu Jan  2 20:19:51 2025 +0200
Commit:     Helge Deller <[email protected]>
CommitDate: Thu Jan  9 00:37:54 2025 +0100

    When using touchscreen and framebuffer, Nokia 770 crashes easily with:

        BUG: scheduling while atomic: irq/144-ads7846/82/0x00010000
        Modules linked in: usb_f_ecm g_ether usb_f_rndis u_ether libcomposite configfs omap_udc ohci_omap ohci_hcd
        CPU: 0 UID: 0 PID: 82 Comm: irq/144-ads7846 Not tainted 6.12.7-770 #2
        Hardware name: Nokia 770
        Call trace:
         unwind_backtrace from show_stack+0x10/0x14
         show_stack from dump_stack_lvl+0x54/0x5c
         dump_stack_lvl from __schedule_bug+0x50/0x70
         __schedule_bug from __schedule+0x4d4/0x5bc
         __schedule from schedule+0x34/0xa0
         schedule from schedule_preempt_disabled+0xc/0x10
         schedule_preempt_disabled from __mutex_lock.constprop.0+0x218/0x3b4
         __mutex_lock.constprop.0 from clk_prepare_lock+0x38/0xe4
         clk_prepare_lock from clk_set_rate+0x18/0x154
         clk_set_rate from sossi_read_data+0x4c/0x168
         sossi_read_data from hwa742_read_reg+0x5c/0x8c
         hwa742_read_reg from send_frame_handler+0xfc/0x300
         send_frame_handler from process_pending_requests+0x74/0xd0
         process_pending_requests from lcd_dma_irq_handler+0x50/0x74
         lcd_dma_irq_handler from __handle_irq_event_percpu+0x44/0x130
         __handle_irq_event_percpu from handle_irq_event+0x28/0x68
         handle_irq_event from handle_level_irq+0x9c/0x170
         handle_level_irq from generic_handle_domain_irq+0x2c/0x3c
         generic_handle_domain_irq from omap1_handle_irq+0x40/0x8c
         omap1_handle_irq from generic_handle_arch_irq+0x28/0x3c
         generic_handle_arch_irq from call_with_stack+0x1c/0x24
         call_with_stack from __irq_svc+0x94/0xa8
        Exception stack(0xc5255da0 to 0xc5255de8)
        5da0: 00000001 c22fc620 00000000 00000000 c08384a8 c106fc00 00000000 c240c248
        5dc0: c113a600 c3f6ec30 00000001 00000000 c22fc620 c5255df0 c22fc620 c0279a94
        5de0: 60000013 ffffffff
         __irq_svc from clk_prepare_lock+0x4c/0xe4
         clk_prepare_lock from clk_get_rate+0x10/0x74
         clk_get_rate from uwire_setup_transfer+0x40/0x180
         uwire_setup_transfer from spi_bitbang_transfer_one+0x2c/0x9c
         spi_bitbang_transfer_one from spi_transfer_one_message+0x2d0/0x664
         spi_transfer_one_message from __spi_pump_transfer_message+0x29c/0x498
         __spi_pump_transfer_message from __spi_sync+0x1f8/0x2e8
         __spi_sync from spi_sync+0x24/0x40
         spi_sync from ads7846_halfd_read_state+0x5c/0x1c0
         ads7846_halfd_read_state from ads7846_irq+0x58/0x348
         ads7846_irq from irq_thread_fn+0x1c/0x78
         irq_thread_fn from irq_thread+0x120/0x228
         irq_thread from kthread+0xc8/0xe8
         kthread from ret_from_fork+0x14/0x28

    As a quick fix, switch to a threaded IRQ which provides a stable system.

    Signed-off-by: Aaro Koskinen <[email protected]>
    Reviewed-by: Linus Walleij <[email protected]>
    Signed-off-by: Helge Deller <[email protected]>

Signed-off-by: Jocelyn Falempe <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 30, 2025
Intel TDX protects guest VM's from malicious host and certain physical
attacks.  TDX introduces a new operation mode, Secure Arbitration Mode
(SEAM) to isolate and protect guest VM's.  A TDX guest VM runs in SEAM and,
unlike VMX, direct control and interaction with the guest by the host VMM
is not possible.  Instead, Intel TDX Module, which also runs in SEAM,
provides a SEAMCALL API.

The SEAMCALL that provides the ability to enter a guest is TDH.VP.ENTER.
The TDX Module processes TDH.VP.ENTER, and enters the guest via VMX
VMLAUNCH/VMRESUME instructions.  When a guest VM-exit requires host VMM
interaction, the TDH.VP.ENTER SEAMCALL returns to the host VMM (KVM).

Add tdh_vp_enter() to wrap the SEAMCALL invocation of TDH.VP.ENTER;
tdh_vp_enter() needs to be noinstr because VM entry in KVM is noinstr
as well, which is for two reasons:
* marking the area as CT_STATE_GUEST via guest_state_enter_irqoff() and
  guest_state_exit_irqoff()
* IRET must be avoided between VM-exit and NMI handling, in order to
  avoid prematurely releasing the NMI inhibit.

TDH.VP.ENTER is different from other SEAMCALLs in several ways: it
uses more arguments, and after it returns some host state may need to be
restored.  Therefore tdh_vp_enter() uses __seamcall_saved_ret() instead of
__seamcall_ret(); since it is the only caller of __seamcall_saved_ret(),
it can be made noinstr also.

TDH.VP.ENTER arguments are passed through General Purpose Registers (GPRs).
For the special case of the TD guest invoking TDG.VP.VMCALL, nearly any GPR
can be used, as well as XMM0 to XMM15. Notably, RBP is not used, and Linux
mandates the TDX Module feature NO_RBP_MOD, which is enforced elsewhere.
Additionally, XMM registers are not required for the existing Guest
Hypervisor Communication Interface and are handled by existing KVM code
should they be modified by the guest.

There are 2 input formats and 5 output formats for TDH.VP.ENTER arguments.
Input #1 : Initial entry or following a previous async. TD Exit
Input #2 : Following a previous TDCALL(TDG.VP.VMCALL)
Output #1 : On Error (No TD Entry)
Output #2 : Async. Exits with a VMX Architectural Exit Reason
Output #3 : Async. Exits with a non-VMX TD Exit Status
Output #4 : Async. Exits with Cross-TD Exit Details
Output #5 : On TDCALL(TDG.VP.VMCALL)

Currently, to keep things simple, the wrapper function does not attempt
to support different formats, and just passes all the GPRs that could be
used.  The GPR values are held by KVM in the area set aside for guest
GPRs.  KVM code uses the guest GPR area (vcpu->arch.regs[]) to set up for
or process results of tdh_vp_enter().

Therefore changing tdh_vp_enter() to use more complex argument formats
would also alter the way KVM code interacts with tdh_vp_enter().

Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Adrian Hunter <[email protected]>
Message-ID: <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 30, 2025
[ Upstream commit 1b9366c ]

If waiting for gpu reset done in KFD release_work, thers is WARNING:
possible circular locking dependency detected

  #2  kfd_create_process
        kfd_process_mutex
          flush kfd release work

  #1  kfd release work
        wait for amdgpu reset work

  #0  amdgpu_device_gpu_reset
        kgd2kfd_pre_reset
          kfd_process_mutex

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock((work_completion)(&p->release_work));
                  lock((wq_completion)kfd_process_wq);
                  lock((work_completion)(&p->release_work));
   lock((wq_completion)amdgpu-reset-dev);

To fix this, KFD create process move flush release work outside
kfd_process_mutex.

Signed-off-by: Philip Yang <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 30, 2025
[ Upstream commit 88f7f56 ]

When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush()
generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC,
which causes the flush_bio to be throttled by wbt_wait().

An example from v5.4, similar problem also exists in upstream:

    crash> bt 2091206
    PID: 2091206  TASK: ffff2050df92a300  CPU: 109  COMMAND: "kworker/u260:0"
     #0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8
     #1 [ffff800084a2f820] __schedule at ffff800040bfa0c4
     #2 [ffff800084a2f880] schedule at ffff800040bfa4b4
     #3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4
     #4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc
     #5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0
     #6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254
     #7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38
     #8 [ffff800084a2fa60] generic_make_request at ffff800040570138
     #9 [ffff800084a2fae0] submit_bio at ffff8000405703b4
    #10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs]
    #11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs]
    #12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs]
    #13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs]
    #14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs]
    #15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs]
    #16 [ffff800084a2fdb0] process_one_work at ffff800040111d08
    #17 [ffff800084a2fe00] worker_thread at ffff8000401121cc
    #18 [ffff800084a2fe70] kthread at ffff800040118de4

After commit 2def284 ("xfs: don't allow log IO to be throttled"),
the metadata submitted by xlog_write_iclog() should not be throttled.
But due to the existence of the dm layer, throttling flush_bio indirectly
causes the metadata bio to be throttled.

Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes
wbt_should_throttle() return false to avoid wbt_wait().

Signed-off-by: Jinliang Zheng <[email protected]>
Reviewed-by: Tianxiang Peng <[email protected]>
Reviewed-by: Hao Peng <[email protected]>
Signed-off-by: Mikulas Patocka <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 31, 2025
…mage

WARNING: CPU: 1 PID: 9426 at fs/inode.c:417 drop_nlink+0xac/0xd0
home/cc/linux/fs/inode.c:417
Modules linked in:
CPU: 1 UID: 0 PID: 9426 Comm: syz-executor568 Not tainted
6.14.0-12627-g94d471a4f428 #2 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:drop_nlink+0xac/0xd0 home/cc/linux/fs/inode.c:417
Code: 48 8b 5d 28 be 08 00 00 00 48 8d bb 70 07 00 00 e8 f9 67 e6 ff
f0 48 ff 83 70 07 00 00 5b 5d e9 9a 12 82 ff e8 95 12 82 ff 90
&lt;0f&gt; 0b 90 c7 45 48 ff ff ff ff 5b 5d e9 83 12 82 ff e8 fe 5f e6
ff
RSP: 0018:ffffc900026b7c28 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff8239710f
RDX: ffff888041345a00 RSI: ffffffff8239717b RDI: 0000000000000005
RBP: ffff888054509ad0 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff9ab36f08 R12: ffff88804bb40000
R13: ffff8880545091e0 R14: 0000000000008000 R15: ffff8880545091e0
FS:  000055555d0c5880(0000) GS:ffff8880eb3e3000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f915c55b178 CR3: 0000000050d20000 CR4: 0000000000352ef0
Call Trace:
 <task>
 f2fs_i_links_write home/cc/linux/fs/f2fs/f2fs.h:3194 [inline]
 f2fs_drop_nlink+0xd1/0x3c0 home/cc/linux/fs/f2fs/dir.c:845
 f2fs_delete_entry+0x542/0x1450 home/cc/linux/fs/f2fs/dir.c:909
 f2fs_unlink+0x45c/0x890 home/cc/linux/fs/f2fs/namei.c:581
 vfs_unlink+0x2fb/0x9b0 home/cc/linux/fs/namei.c:4544
 do_unlinkat+0x4c5/0x6a0 home/cc/linux/fs/namei.c:4608
 __do_sys_unlink home/cc/linux/fs/namei.c:4654 [inline]
 __se_sys_unlink home/cc/linux/fs/namei.c:4652 [inline]
 __x64_sys_unlink+0xc5/0x110 home/cc/linux/fs/namei.c:4652
 do_syscall_x64 home/cc/linux/arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xc7/0x250 home/cc/linux/arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb3d092324b
Code: 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 83 c8 ff c3 66
2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 57 00 00 00 0f 05
&lt;48&gt; 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01
48
RSP: 002b:00007ffdc232d938 EFLAGS: 00000206 ORIG_RAX: 0000000000000057
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb3d092324b
RDX: 00007ffdc232d960 RSI: 00007ffdc232d960 RDI: 00007ffdc232d9f0
RBP: 00007ffdc232d9f0 R08: 0000000000000001 R09: 00007ffdc232d7c0
R10: 00000000fffffffd R11: 0000000000000206 R12: 00007ffdc232eaf0
R13: 000055555d0cebb0 R14: 00007ffdc232d958 R15: 0000000000000001
 </task>

Cc: [email protected]
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 1, 2025
Running a modified trace-cmd record --nosplice where it does a mmap of the
ring buffer when '--nosplice' is set, caused the following lockdep splat:

 ======================================================
 WARNING: possible circular locking dependency detected
 6.15.0-rc7-test-00002-gfb7d03d8a82f #551 Not tainted
 ------------------------------------------------------
 trace-cmd/1113 is trying to acquire lock:
 ffff888100062888 (&buffer->mutex){+.+.}-{4:4}, at: ring_buffer_map+0x11c/0xe70

 but task is already holding lock:
 ffff888100a5f9f8 (&cpu_buffer->mapping_lock){+.+.}-{4:4}, at: ring_buffer_map+0xcf/0xe70

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #5 (&cpu_buffer->mapping_lock){+.+.}-{4:4}:
        __mutex_lock+0x192/0x18c0
        ring_buffer_map+0xcf/0xe70
        tracing_buffers_mmap+0x1c4/0x3b0
        __mmap_region+0xd8d/0x1f70
        do_mmap+0x9d7/0x1010
        vm_mmap_pgoff+0x20b/0x390
        ksys_mmap_pgoff+0x2e9/0x440
        do_syscall_64+0x79/0x1c0
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #4 (&mm->mmap_lock){++++}-{4:4}:
        __might_fault+0xa5/0x110
        _copy_to_user+0x22/0x80
        _perf_ioctl+0x61b/0x1b70
        perf_ioctl+0x62/0x90
        __x64_sys_ioctl+0x134/0x190
        do_syscall_64+0x79/0x1c0
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #3 (&cpuctx_mutex){+.+.}-{4:4}:
        __mutex_lock+0x192/0x18c0
        perf_event_init_cpu+0x325/0x7c0
        perf_event_init+0x52a/0x5b0
        start_kernel+0x263/0x3e0
        x86_64_start_reservations+0x24/0x30
        x86_64_start_kernel+0x95/0xa0
        common_startup_64+0x13e/0x141

 -> #2 (pmus_lock){+.+.}-{4:4}:
        __mutex_lock+0x192/0x18c0
        perf_event_init_cpu+0xb7/0x7c0
        cpuhp_invoke_callback+0x2c0/0x1030
        __cpuhp_invoke_callback_range+0xbf/0x1f0
        _cpu_up+0x2e7/0x690
        cpu_up+0x117/0x170
        cpuhp_bringup_mask+0xd5/0x120
        bringup_nonboot_cpus+0x13d/0x170
        smp_init+0x2b/0xf0
        kernel_init_freeable+0x441/0x6d0
        kernel_init+0x1e/0x160
        ret_from_fork+0x34/0x70
        ret_from_fork_asm+0x1a/0x30

 -> #1 (cpu_hotplug_lock){++++}-{0:0}:
        cpus_read_lock+0x2a/0xd0
        ring_buffer_resize+0x610/0x14e0
        __tracing_resize_ring_buffer.part.0+0x42/0x120
        tracing_set_tracer+0x7bd/0xa80
        tracing_set_trace_write+0x132/0x1e0
        vfs_write+0x21c/0xe80
        ksys_write+0xf9/0x1c0
        do_syscall_64+0x79/0x1c0
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #0 (&buffer->mutex){+.+.}-{4:4}:
        __lock_acquire+0x1405/0x2210
        lock_acquire+0x174/0x310
        __mutex_lock+0x192/0x18c0
        ring_buffer_map+0x11c/0xe70
        tracing_buffers_mmap+0x1c4/0x3b0
        __mmap_region+0xd8d/0x1f70
        do_mmap+0x9d7/0x1010
        vm_mmap_pgoff+0x20b/0x390
        ksys_mmap_pgoff+0x2e9/0x440
        do_syscall_64+0x79/0x1c0
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 other info that might help us debug this:

 Chain exists of:
   &buffer->mutex --> &mm->mmap_lock --> &cpu_buffer->mapping_lock

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&cpu_buffer->mapping_lock);
                                lock(&mm->mmap_lock);
                                lock(&cpu_buffer->mapping_lock);
   lock(&buffer->mutex);

  *** DEADLOCK ***

 2 locks held by trace-cmd/1113:
  #0: ffff888106b847e0 (&mm->mmap_lock){++++}-{4:4}, at: vm_mmap_pgoff+0x192/0x390
  #1: ffff888100a5f9f8 (&cpu_buffer->mapping_lock){+.+.}-{4:4}, at: ring_buffer_map+0xcf/0xe70

 stack backtrace:
 CPU: 5 UID: 0 PID: 1113 Comm: trace-cmd Not tainted 6.15.0-rc7-test-00002-gfb7d03d8a82f #551 PREEMPT
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x6e/0xa0
  print_circular_bug.cold+0x178/0x1be
  check_noncircular+0x146/0x160
  __lock_acquire+0x1405/0x2210
  lock_acquire+0x174/0x310
  ? ring_buffer_map+0x11c/0xe70
  ? ring_buffer_map+0x11c/0xe70
  ? __mutex_lock+0x169/0x18c0
  __mutex_lock+0x192/0x18c0
  ? ring_buffer_map+0x11c/0xe70
  ? ring_buffer_map+0x11c/0xe70
  ? function_trace_call+0x296/0x370
  ? __pfx___mutex_lock+0x10/0x10
  ? __pfx_function_trace_call+0x10/0x10
  ? __pfx___mutex_lock+0x10/0x10
  ? _raw_spin_unlock+0x2d/0x50
  ? ring_buffer_map+0x11c/0xe70
  ? ring_buffer_map+0x11c/0xe70
  ? __mutex_lock+0x5/0x18c0
  ring_buffer_map+0x11c/0xe70
  ? do_raw_spin_lock+0x12d/0x270
  ? find_held_lock+0x2b/0x80
  ? _raw_spin_unlock+0x2d/0x50
  ? rcu_is_watching+0x15/0xb0
  ? _raw_spin_unlock+0x2d/0x50
  ? trace_preempt_on+0xd0/0x110
  tracing_buffers_mmap+0x1c4/0x3b0
  __mmap_region+0xd8d/0x1f70
  ? ring_buffer_lock_reserve+0x99/0xff0
  ? __pfx___mmap_region+0x10/0x10
  ? ring_buffer_lock_reserve+0x99/0xff0
  ? __pfx_ring_buffer_lock_reserve+0x10/0x10
  ? __pfx_ring_buffer_lock_reserve+0x10/0x10
  ? bpf_lsm_mmap_addr+0x4/0x10
  ? security_mmap_addr+0x46/0xd0
  ? lock_is_held_type+0xd9/0x130
  do_mmap+0x9d7/0x1010
  ? 0xffffffffc0370095
  ? __pfx_do_mmap+0x10/0x10
  vm_mmap_pgoff+0x20b/0x390
  ? __pfx_vm_mmap_pgoff+0x10/0x10
  ? 0xffffffffc0370095
  ksys_mmap_pgoff+0x2e9/0x440
  do_syscall_64+0x79/0x1c0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 RIP: 0033:0x7fb0963a7de2
 Code: 00 00 00 0f 1f 44 00 00 41 f7 c1 ff 0f 00 00 75 27 55 89 cd 53 48 89 fb 48 85 ff 74 3b 41 89 ea 48 89 df b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 5b 5d c3 0f 1f 00 48 8b 05 e1 9f 0d 00 64
 RSP: 002b:00007ffdcc8fb878 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb0963a7de2
 RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000
 RBP: 0000000000000001 R08: 0000000000000006 R09: 0000000000000000
 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007ffdcc8fbe68 R14: 00007fb096628000 R15: 00005633e01a5c90
  </TASK>

The issue is that cpus_read_lock() is taken within buffer->mutex. The
memory mapped pages are taken with the mmap_lock held. The buffer->mutex
is taken within the cpu_buffer->mapping_lock. There's quite a chain with
all these locks, where the deadlock can be fixed by moving the
cpus_read_lock() outside the taking of the buffer->mutex.

Cc: [email protected]
Cc: Masami Hiramatsu <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Vincent Donnefort <[email protected]>
Link: https://lore.kernel.org/[email protected]
Fixes: 117c392 ("ring-buffer: Introducing ring-buffer mapping functions")
Signed-off-by: Steven Rostedt (Google) <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 3, 2025
Restore KVM's handling of a NULL kvm_x86_ops.mem_enc_ioctl, as the hook is
NULL on SVM when CONFIG_KVM_AMD_SEV=n, and TDX will soon follow suit.

  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 1 at arch/x86/include/asm/kvm-x86-ops.h:130 kvm_x86_vendor_init+0x178b/0x18e0
  Modules linked in:
  CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.15.0-rc2-dc1aead1a985-sink-vm #2 NONE
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_x86_vendor_init+0x178b/0x18e0
  Call Trace:
   <TASK>
   svm_init+0x2e/0x60
   do_one_initcall+0x56/0x290
   kernel_init_freeable+0x192/0x1e0
   kernel_init+0x16/0x130
   ret_from_fork+0x30/0x50
   ret_from_fork_asm+0x1a/0x30
   </TASK>
  ---[ end trace 0000000000000000 ]---

Opportunistically drop the superfluous curly braces.

Link: https://lore.kernel.org/all/[email protected]
Fixes: b2aaf38 ("KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 3, 2025
Despite the fact that several lockdep-related checks are skipped when
calling trylock* versions of the locking primitives, for example
mutex_trylock, each time the mutex is acquired, a held_lock is still
placed onto the lockdep stack by __lock_acquire() which is called
regardless of whether the trylock* or regular locking API was used.

This means that if the caller successfully acquires more than
MAX_LOCK_DEPTH locks of the same class, even when using mutex_trylock,
lockdep will still complain that the maximum depth of the held lock stack
has been reached and disable itself.

For example, the following error currently occurs in the ARM version
of KVM, once the code tries to lock all vCPUs of a VM configured with more
than MAX_LOCK_DEPTH vCPUs, a situation that can easily happen on modern
systems, where having more than 48 CPUs is common, and it's also common to
run VMs that have vCPU counts approaching that number:

[  328.171264] BUG: MAX_LOCK_DEPTH too low!
[  328.175227] turning off the locking correctness validator.
[  328.180726] Please attach the output of /proc/lock_stat to the bug report
[  328.187531] depth: 48  max: 48!
[  328.190678] 48 locks held by qemu-kvm/11664:
[  328.194957]  #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0
[  328.204048]  #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.212521]  #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.220991]  #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.229463]  #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.237934]  #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.246405]  #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0

Luckily, in all instances that require locking all vCPUs, the
'kvm->lock' is taken a priori, and that fact makes it possible to use
the little known feature of lockdep, called a 'nest_lock', to avoid this
warning and subsequent lockdep self-disablement.

The action of 'nested lock' being provided to lockdep's lock_acquire(),
causes the lockdep to detect that the top of the held lock stack contains
a lock of the same class and then increment its reference counter instead
of pushing a new held_lock item onto that stack.

See __lock_acquire for more information.

Signed-off-by: Maxim Levitsky <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 3, 2025
Use kvm_trylock_all_vcpus instead of a custom implementation when locking
all vCPUs of a VM, to avoid triggering a lockdep warning, in the case in
which the VM is configured to have more than MAX_LOCK_DEPTH vCPUs.

This fixes the following false lockdep warning:

[  328.171264] BUG: MAX_LOCK_DEPTH too low!
[  328.175227] turning off the locking correctness validator.
[  328.180726] Please attach the output of /proc/lock_stat to the bug report
[  328.187531] depth: 48  max: 48!
[  328.190678] 48 locks held by qemu-kvm/11664:
[  328.194957]  #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0
[  328.204048]  #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.212521]  #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.220991]  #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.229463]  #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.237934]  #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.246405]  #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Maxim Levitsky <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 4, 2025
JIRA: https://issues.redhat.com/browse/RHEL-83595

commit 9730763
Author: Nilay Shroff <[email protected]>
Date:   Wed Mar 19 16:23:46 2025 +0530

    block: correct locking order for protecting blk-wbt parameters

    The commit '245618f8e45f ("block: protect wbt_lat_usec using q->
    elevator_lock")' introduced q->elevator_lock to protect updates
    to blk-wbt parameters when writing to the sysfs attribute wbt_
    lat_usec and the cgroup attribute io.cost.qos.  However, both
    these attributes also acquire q->rq_qos_mutex, leading to the
    following lockdep warning:

    ======================================================
    WARNING: possible circular locking dependency detected
    6.14.0-rc5+ #138 Not tainted
    ------------------------------------------------------
    bash/5902 is trying to acquire lock:
    c000000085d495a0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: wbt_init+0x164/0x238

    but task is already holding lock:
    c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&q->elevator_lock){+.+.}-{4:4}:
            __mutex_lock+0xf0/0xa58
            ioc_qos_write+0x16c/0x85c
            cgroup_file_write+0xc4/0x32c
            kernfs_fop_write_iter+0x1b8/0x29c
            vfs_write+0x410/0x584
            ksys_write+0x84/0x140
            system_call_exception+0x134/0x360
            system_call_vectored_common+0x15c/0x2ec

    -> #0 (&q->rq_qos_mutex){+.+.}-{4:4}:
            __lock_acquire+0x1b6c/0x2ae0
            lock_acquire+0x140/0x430
            __mutex_lock+0xf0/0xa58
            wbt_init+0x164/0x238
            queue_wb_lat_store+0x1dc/0x20c
            queue_attr_store+0x12c/0x164
            sysfs_kf_write+0x6c/0xb0
            kernfs_fop_write_iter+0x1b8/0x29c
            vfs_write+0x410/0x584
            ksys_write+0x84/0x140
            system_call_exception+0x134/0x360
            system_call_vectored_common+0x15c/0x2ec

    other info that might help us debug this:

        Possible unsafe locking scenario:

            CPU0                    CPU1
            ----                    ----
        lock(&q->elevator_lock);
                                    lock(&q->rq_qos_mutex);
                                    lock(&q->elevator_lock);
        lock(&q->rq_qos_mutex);

        *** DEADLOCK ***

    6 locks held by bash/5902:
        #0: c000000051122400 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x84/0x140
        #1: c00000007383f088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x174/0x29c
        #2: c000000008550428 (kn->active#182){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x180/0x29c
        #3: c000000085d493a8 (&q->q_usage_counter(io)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
        #4: c000000085d493e0 (&q->q_usage_counter(queue)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
        #5: c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c

    stack backtrace:
    CPU: 17 UID: 0 PID: 5902 Comm: bash Kdump: loaded Not tainted 6.14.0-rc5+ #138
    Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
    Call Trace:
    [c0000000721ef590] [c00000000118f8a8] dump_stack_lvl+0x108/0x18c (unreliable)
    [c0000000721ef5c0] [c00000000022563c] print_circular_bug+0x448/0x604
    [c0000000721ef670] [c000000000225a44] check_noncircular+0x24c/0x26c
    [c0000000721ef740] [c00000000022bf28] __lock_acquire+0x1b6c/0x2ae0
    [c0000000721ef870] [c000000000229240] lock_acquire+0x140/0x430
    [c0000000721ef970] [c0000000011cfbec] __mutex_lock+0xf0/0xa58
    [c0000000721efaa0] [c00000000096c46c] wbt_init+0x164/0x238
    [c0000000721efaf0] [c0000000008f8cd8] queue_wb_lat_store+0x1dc/0x20c
    [c0000000721efb50] [c0000000008f8fa0] queue_attr_store+0x12c/0x164
    [c0000000721efc60] [c0000000007c11cc] sysfs_kf_write+0x6c/0xb0
    [c0000000721efca0] [c0000000007bfa4c] kernfs_fop_write_iter+0x1b8/0x29c
    [c0000000721efcf0] [c0000000006a281c] vfs_write+0x410/0x584
    [c0000000721efdc0] [c0000000006a2cc8] ksys_write+0x84/0x140
    [c0000000721efe10] [c000000000031b64] system_call_exception+0x134/0x360
    [c0000000721efe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec

    >From the above log it's apparent that method which writes to sysfs attr
    wbt_lat_usec acquires q->elevator_lock first, and then acquires q->rq_
    qos_mutex. However the another method which writes to io.cost.qos,
    acquires q->rq_qos_mutex first, and then acquires q->rq_qos_mutex. So
    this could potentially cause the deadlock.

    A closer look at ioc_qos_write shows that correcting the lock order is
    non-trivial because q->rq_qos_mutex is acquired in blkg_conf_open_bdev
    and released in blkg_conf_exit. The function blkg_conf_open_bdev is
    responsible for parsing user input and finding the corresponding block
    device (bdev) from the user provided major:minor number.

    Since we do not know the bdev until blkg_conf_open_bdev completes, we
    cannot simply move q->elevator_lock acquisition before blkg_conf_open_
    bdev. So to address this, we intoduce new helpers blkg_conf_open_bdev_
    frozen and blkg_conf_exit_frozen which are just wrappers around blkg_
    conf_open_bdev and blkg_conf_exit respectively. The helper blkg_conf_
    open_bdev_frozen is similar to blkg_conf_open_bdev, but additionally
    freezes the queue, acquires q->elevator_lock and ensures the correct
    locking order is followed between q->elevator_lock and q->rq_qos_mutex.
    Similarly another helper blkg_conf_exit_frozen in addition to unfreezing
    the queue ensures that we release the locks in correct order.

    By using these helpers, now we maintain the same locking order in all
    code paths where we update blk-wbt parameters.

    Fixes: 245618f ("block: protect wbt_lat_usec using q->elevator_lock")
    Reported-by: kernel test robot <[email protected]>
    Closes: https://lore.kernel.org/oe-lkp/[email protected]
    Signed-off-by: Nilay Shroff <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Jens Axboe <[email protected]>

Signed-off-by: Ming Lei <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 4, 2025
JIRA: https://issues.redhat.com/browse/RHEL-92762
Upstream Status: kernel/git/torvalds/linux.git

commit 88f7f56
Author: Jinliang Zheng <[email protected]>
Date:   Thu Feb 20 19:20:14 2025 +0800

    dm: fix unconditional IO throttle caused by REQ_PREFLUSH

    When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush()
    generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC,
    which causes the flush_bio to be throttled by wbt_wait().

    An example from v5.4, similar problem also exists in upstream:

        crash> bt 2091206
        PID: 2091206  TASK: ffff2050df92a300  CPU: 109  COMMAND: "kworker/u260:0"
         #0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8
         #1 [ffff800084a2f820] __schedule at ffff800040bfa0c4
         #2 [ffff800084a2f880] schedule at ffff800040bfa4b4
         #3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4
         #4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc
         #5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0
         #6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254
         #7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38
         #8 [ffff800084a2fa60] generic_make_request at ffff800040570138
         #9 [ffff800084a2fae0] submit_bio at ffff8000405703b4
        #10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs]
        #11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs]
        #12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs]
        #13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs]
        #14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs]
        #15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs]
        #16 [ffff800084a2fdb0] process_one_work at ffff800040111d08
        #17 [ffff800084a2fe00] worker_thread at ffff8000401121cc
        #18 [ffff800084a2fe70] kthread at ffff800040118de4

    After commit 2def284 ("xfs: don't allow log IO to be throttled"),
    the metadata submitted by xlog_write_iclog() should not be throttled.
    But due to the existence of the dm layer, throttling flush_bio indirectly
    causes the metadata bio to be throttled.

    Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes
    wbt_should_throttle() return false to avoid wbt_wait().

    Signed-off-by: Jinliang Zheng <[email protected]>
    Reviewed-by: Tianxiang Peng <[email protected]>
    Reviewed-by: Hao Peng <[email protected]>
    Signed-off-by: Mikulas Patocka <[email protected]>

Signed-off-by: Benjamin Marzinski <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 4, 2025
JIRA: https://issues.redhat.com/browse/RHEL-92996

commit 053f3ff
Author: Dave Marquardt <[email protected]>
Date:   Wed Apr 2 10:44:03 2025 -0500

    net: ibmveth: make veth_pool_store stop hanging

    v2:
    - Created a single error handling unlock and exit in veth_pool_store
    - Greatly expanded commit message with previous explanatory-only text

    Summary: Use rtnl_mutex to synchronize veth_pool_store with itself,
    ibmveth_close and ibmveth_open, preventing multiple calls in a row to
    napi_disable.

    Background: Two (or more) threads could call veth_pool_store through
    writing to /sys/devices/vio/30000002/pool*/*. You can do this easily
    with a little shell script. This causes a hang.

    I configured LOCKDEP, compiled ibmveth.c with DEBUG, and built a new
    kernel. I ran this test again and saw:

        Setting pool0/active to 0
        Setting pool1/active to 1
        [   73.911067][ T4365] ibmveth 30000002 eth0: close starting
        Setting pool1/active to 1
        Setting pool1/active to 0
        [   73.911367][ T4366] ibmveth 30000002 eth0: close starting
        [   73.916056][ T4365] ibmveth 30000002 eth0: close complete
        [   73.916064][ T4365] ibmveth 30000002 eth0: open starting
        [  110.808564][  T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
        [  230.808495][  T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
        [  243.683786][  T123] INFO: task stress.sh:4365 blocked for more than 122 seconds.
        [  243.683827][  T123]       Not tainted 6.14.0-01103-g2df0c02dab82-dirty #8
        [  243.683833][  T123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [  243.683838][  T123] task:stress.sh       state:D stack:28096 pid:4365  tgid:4365  ppid:4364   task_flags:0x400040 flags:0x00042000
        [  243.683852][  T123] Call Trace:
        [  243.683857][  T123] [c00000000c38f690] [0000000000000001] 0x1 (unreliable)
        [  243.683868][  T123] [c00000000c38f840] [c00000000001f908] __switch_to+0x318/0x4e0
        [  243.683878][  T123] [c00000000c38f8a0] [c000000001549a70] __schedule+0x500/0x12a0
        [  243.683888][  T123] [c00000000c38f9a0] [c00000000154a878] schedule+0x68/0x210
        [  243.683896][  T123] [c00000000c38f9d0] [c00000000154ac80] schedule_preempt_disabled+0x30/0x50
        [  243.683904][  T123] [c00000000c38fa00] [c00000000154dbb0] __mutex_lock+0x730/0x10f0
        [  243.683913][  T123] [c00000000c38fb10] [c000000001154d40] napi_enable+0x30/0x60
        [  243.683921][  T123] [c00000000c38fb40] [c000000000f4ae94] ibmveth_open+0x68/0x5dc
        [  243.683928][  T123] [c00000000c38fbe0] [c000000000f4aa20] veth_pool_store+0x220/0x270
        [  243.683936][  T123] [c00000000c38fc70] [c000000000826278] sysfs_kf_write+0x68/0xb0
        [  243.683944][  T123] [c00000000c38fcb0] [c0000000008240b8] kernfs_fop_write_iter+0x198/0x2d0
        [  243.683951][  T123] [c00000000c38fd00] [c00000000071b9ac] vfs_write+0x34c/0x650
        [  243.683958][  T123] [c00000000c38fdc0] [c00000000071bea8] ksys_write+0x88/0x150
        [  243.683966][  T123] [c00000000c38fe10] [c0000000000317f4] system_call_exception+0x124/0x340
        [  243.683973][  T123] [c00000000c38fe50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
        ...
        [  243.684087][  T123] Showing all locks held in the system:
        [  243.684095][  T123] 1 lock held by khungtaskd/123:
        [  243.684099][  T123]  #0: c00000000278e370 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x50/0x248
        [  243.684114][  T123] 4 locks held by stress.sh/4365:
        [  243.684119][  T123]  #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
        [  243.684132][  T123]  #1: c000000041aea888 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
        [  243.684143][  T123]  #2: c0000000366fb9a8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
        [  243.684155][  T123]  #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_enable+0x30/0x60
        [  243.684166][  T123] 5 locks held by stress.sh/4366:
        [  243.684170][  T123]  #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
        [  243.684183][  T123]  #1: c00000000aee2288 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
        [  243.684194][  T123]  #2: c0000000366f4ba8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
        [  243.684205][  T123]  #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_disable+0x30/0x60
        [  243.684216][  T123]  #4: c0000003ff9bbf18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x138/0x12a0

    From the ibmveth debug, two threads are calling veth_pool_store, which
    calls ibmveth_close and ibmveth_open. Here's the sequence:

      T4365             T4366
      ----------------- ----------------- ---------
      veth_pool_store   veth_pool_store
                        ibmveth_close
      ibmveth_close
      napi_disable
                        napi_disable
      ibmveth_open
      napi_enable                         <- HANG

    ibmveth_close calls napi_disable at the top and ibmveth_open calls
    napi_enable at the top.

    https://docs.kernel.org/networking/napi.html]] says

      The control APIs are not idempotent. Control API calls are safe
      against concurrent use of datapath APIs but an incorrect sequence of
      control API calls may result in crashes, deadlocks, or race
      conditions. For example, calling napi_disable() multiple times in a
      row will deadlock.

    In the normal open and close paths, rtnl_mutex is acquired to prevent
    other callers. This is missing from veth_pool_store. Use rtnl_mutex in
    veth_pool_store fixes these hangs.

    Signed-off-by: Dave Marquardt <[email protected]>
    Fixes: 860f242 ("[PATCH] ibmveth change buffer pools dynamically")
    Reviewed-by: Nick Child <[email protected]>
    Reviewed-by: Simon Horman <[email protected]>
    Link: https://patch.msgid.link/[email protected]
    Signed-off-by: Jakub Kicinski <[email protected]>

Signed-off-by: Mamatha Inamdar <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 4, 2025
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-10/-/merge_requests/956

Description: Updates for ibmveth pool store

JIRA: https://issues.redhat.com/browse/RHEL-92996

Build Info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=67710112

Tested: Verified Brew build test kernel RPMs and confirmed issue is resovled

Signed-off-by: Mamatha Inamdar <[email protected]>

commit 053f3ff
Author: Dave Marquardt <[email protected]>
Date:   Wed Apr 2 10:44:03 2025 -0500

    net: ibmveth: make veth_pool_store stop hanging

    v2:
    - Created a single error handling unlock and exit in veth_pool_store
    - Greatly expanded commit message with previous explanatory-only text

    Summary: Use rtnl_mutex to synchronize veth_pool_store with itself,
    ibmveth_close and ibmveth_open, preventing multiple calls in a row to
    napi_disable.

    Background: Two (or more) threads could call veth_pool_store through
    writing to /sys/devices/vio/30000002/pool*/*. You can do this easily
    with a little shell script. This causes a hang.

    I configured LOCKDEP, compiled ibmveth.c with DEBUG, and built a new
    kernel. I ran this test again and saw:

        Setting pool0/active to 0
        Setting pool1/active to 1
        [   73.911067][ T4365] ibmveth 30000002 eth0: close starting
        Setting pool1/active to 1
        Setting pool1/active to 0
        [   73.911367][ T4366] ibmveth 30000002 eth0: close starting
        [   73.916056][ T4365] ibmveth 30000002 eth0: close complete
        [   73.916064][ T4365] ibmveth 30000002 eth0: open starting
        [  110.808564][  T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
        [  230.808495][  T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
        [  243.683786][  T123] INFO: task stress.sh:4365 blocked for more than 122 seconds.
        [  243.683827][  T123]       Not tainted 6.14.0-01103-g2df0c02dab82-dirty #8
        [  243.683833][  T123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [  243.683838][  T123] task:stress.sh       state:D stack:28096 pid:4365  tgid:4365  ppid:4364   task_flags:0x400040 flags:0x00042000
        [  243.683852][  T123] Call Trace:
        [  243.683857][  T123] [c00000000c38f690] [0000000000000001] 0x1 (unreliable)
        [  243.683868][  T123] [c00000000c38f840] [c00000000001f908] __switch_to+0x318/0x4e0
        [  243.683878][  T123] [c00000000c38f8a0] [c000000001549a70] __schedule+0x500/0x12a0
        [  243.683888][  T123] [c00000000c38f9a0] [c00000000154a878] schedule+0x68/0x210
        [  243.683896][  T123] [c00000000c38f9d0] [c00000000154ac80] schedule_preempt_disabled+0x30/0x50
        [  243.683904][  T123] [c00000000c38fa00] [c00000000154dbb0] __mutex_lock+0x730/0x10f0
        [  243.683913][  T123] [c00000000c38fb10] [c000000001154d40] napi_enable+0x30/0x60
        [  243.683921][  T123] [c00000000c38fb40] [c000000000f4ae94] ibmveth_open+0x68/0x5dc
        [  243.683928][  T123] [c00000000c38fbe0] [c000000000f4aa20] veth_pool_store+0x220/0x270
        [  243.683936][  T123] [c00000000c38fc70] [c000000000826278] sysfs_kf_write+0x68/0xb0
        [  243.683944][  T123] [c00000000c38fcb0] [c0000000008240b8] kernfs_fop_write_iter+0x198/0x2d0
        [  243.683951][  T123] [c00000000c38fd00] [c00000000071b9ac] vfs_write+0x34c/0x650
        [  243.683958][  T123] [c00000000c38fdc0] [c00000000071bea8] ksys_write+0x88/0x150
        [  243.683966][  T123] [c00000000c38fe10] [c0000000000317f4] system_call_exception+0x124/0x340
        [  243.683973][  T123] [c00000000c38fe50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
        ...
        [  243.684087][  T123] Showing all locks held in the system:
        [  243.684095][  T123] 1 lock held by khungtaskd/123:
        [  243.684099][  T123]  #0: c00000000278e370 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x50/0x248
        [  243.684114][  T123] 4 locks held by stress.sh/4365:
        [  243.684119][  T123]  #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
        [  243.684132][  T123]  #1: c000000041aea888 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
        [  243.684143][  T123]  #2: c0000000366fb9a8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
        [  243.684155][  T123]  #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_enable+0x30/0x60
        [  243.684166][  T123] 5 locks held by stress.sh/4366:
        [  243.684170][  T123]  #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
        [  243.684183][  T123]  #1: c00000000aee2288 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
        [  243.684194][  T123]  #2: c0000000366f4ba8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
        [  243.684205][  T123]  #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_disable+0x30/0x60
        [  243.684216][  T123]  #4: c0000003ff9bbf18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x138/0x12a0

    From the ibmveth debug, two threads are calling veth_pool_store, which
    calls ibmveth_close and ibmveth_open. Here's the sequence:

      T4365             T4366
      ----------------- ----------------- ---------
      veth_pool_store   veth_pool_store
                        ibmveth_close
      ibmveth_close
      napi_disable
                        napi_disable
      ibmveth_open
      napi_enable                         <- HANG

    ibmveth_close calls napi_disable at the top and ibmveth_open calls
    napi_enable at the top.

    https://docs.kernel.org/networking/napi.html]] says

      The control APIs are not idempotent. Control API calls are safe
      against concurrent use of datapath APIs but an incorrect sequence of
      control API calls may result in crashes, deadlocks, or race
      conditions. For example, calling napi_disable() multiple times in a
      row will deadlock.

    In the normal open and close paths, rtnl_mutex is acquired to prevent
    other callers. This is missing from veth_pool_store. Use rtnl_mutex in
    veth_pool_store fixes these hangs.

    Signed-off-by: Dave Marquardt <[email protected]>
    Fixes: 860f242 ("[PATCH] ibmveth change buffer pools dynamically")
    Reviewed-by: Nick Child <[email protected]>
    Reviewed-by: Simon Horman <[email protected]>
    Link: https://patch.msgid.link/[email protected]
    Signed-off-by: Jakub Kicinski <[email protected]>

Signed-off-by: Mamatha Inamdar <[email protected]>

Approved-by: Steve Best <[email protected]>
Approved-by: Michal Schmidt <[email protected]>
Approved-by: CKI KWF Bot <[email protected]>

Merged-by: Julio Faracco <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 5, 2025
JIRA: https://issues.redhat.com/browse/RHEL-73484

commit e40b801
Author: D. Wythe <[email protected]>
Date:   Thu Feb 16 14:37:36 2023 +0800

    net/smc: fix potential panic dues to unprotected smc_llc_srv_add_link()

    There is a certain chance to trigger the following panic:

    PID: 5900   TASK: ffff88c1c8af4100  CPU: 1   COMMAND: "kworker/1:48"
     #0 [ffff9456c1cc79a0] machine_kexec at ffffffff870665b7
     #1 [ffff9456c1cc79f0] __crash_kexec at ffffffff871b4c7a
     #2 [ffff9456c1cc7ab0] crash_kexec at ffffffff871b5b60
     #3 [ffff9456c1cc7ac0] oops_end at ffffffff87026ce7
     #4 [ffff9456c1cc7ae0] page_fault_oops at ffffffff87075715
     #5 [ffff9456c1cc7b58] exc_page_fault at ffffffff87ad0654
     #6 [ffff9456c1cc7b80] asm_exc_page_fault at ffffffff87c00b62
        [exception RIP: ib_alloc_mr+19]
        RIP: ffffffffc0c9cce3  RSP: ffff9456c1cc7c38  RFLAGS: 00010202
        RAX: 0000000000000000  RBX: 0000000000000002  RCX: 0000000000000004
        RDX: 0000000000000010  RSI: 0000000000000000  RDI: 0000000000000000
        RBP: ffff88c1ea281d00   R8: 000000020a34ffff   R9: ffff88c1350bbb20
        R10: 0000000000000000  R11: 0000000000000001  R12: 0000000000000000
        R13: 0000000000000010  R14: ffff88c1ab040a50  R15: ffff88c1ea281d00
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
     #7 [ffff9456c1cc7c60] smc_ib_get_memory_region at ffffffffc0aff6df [smc]
     #8 [ffff9456c1cc7c88] smcr_buf_map_link at ffffffffc0b0278c [smc]
     #9 [ffff9456c1cc7ce0] __smc_buf_create at ffffffffc0b03586 [smc]

    The reason here is that when the server tries to create a second link,
    smc_llc_srv_add_link() has no protection and may add a new link to
    link group. This breaks the security environment protected by
    llc_conf_mutex.

    Fixes: 2d2209f ("net/smc: first part of add link processing as SMC server")
    Signed-off-by: D. Wythe <[email protected]>
    Reviewed-by: Larysa Zaremba <[email protected]>
    Reviewed-by: Wenjia Zhang <[email protected]>
    Signed-off-by: David S. Miller <[email protected]>

Signed-off-by: Mete Durlu <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 5, 2025
JIRA: https://issues.redhat.com/browse/RHEL-92761
Upstream Status: kernel/git/torvalds/linux.git

commit 88f7f56
Author: Jinliang Zheng <[email protected]>
Date:   Thu Feb 20 19:20:14 2025 +0800

    dm: fix unconditional IO throttle caused by REQ_PREFLUSH

    When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush()
    generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC,
    which causes the flush_bio to be throttled by wbt_wait().

    An example from v5.4, similar problem also exists in upstream:

        crash> bt 2091206
        PID: 2091206  TASK: ffff2050df92a300  CPU: 109  COMMAND: "kworker/u260:0"
         #0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8
         #1 [ffff800084a2f820] __schedule at ffff800040bfa0c4
         #2 [ffff800084a2f880] schedule at ffff800040bfa4b4
         #3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4
         #4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc
         #5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0
         #6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254
         #7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38
         #8 [ffff800084a2fa60] generic_make_request at ffff800040570138
         #9 [ffff800084a2fae0] submit_bio at ffff8000405703b4
        #10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs]
        #11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs]
        #12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs]
        #13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs]
        #14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs]
        #15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs]
        #16 [ffff800084a2fdb0] process_one_work at ffff800040111d08
        #17 [ffff800084a2fe00] worker_thread at ffff8000401121cc
        #18 [ffff800084a2fe70] kthread at ffff800040118de4

    After commit 2def284 ("xfs: don't allow log IO to be throttled"),
    the metadata submitted by xlog_write_iclog() should not be throttled.
    But due to the existence of the dm layer, throttling flush_bio indirectly
    causes the metadata bio to be throttled.

    Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes
    wbt_should_throttle() return false to avoid wbt_wait().

    Signed-off-by: Jinliang Zheng <[email protected]>
    Reviewed-by: Tianxiang Peng <[email protected]>
    Reviewed-by: Hao Peng <[email protected]>
    Signed-off-by: Mikulas Patocka <[email protected]>

Signed-off-by: Benjamin Marzinski <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 5, 2025
JIRA: https://issues.redhat.com/browse/RHEL-77936

upstream
========
commit 2adbf53
Author: Athira Rajeev <[email protected]>
Date: Mon Dec 23 19:28:13 2024 +0530

description
===========
When kernel is built without debuginfo, running 'perf record' with
--off-cpu results in segfault as below:

   ./perf record --off-cpu -e dummy sleep 1
   libbpf: kernel BTF is missing at '/sys/kernel/btf/vmlinux', was CONFIG_DEBUG_INFO_BTF enabled?
   libbpf: failed to find '.BTF' ELF section in /lib/modules/6.13.0-rc3+/build/vmlinux
   libbpf: failed to find valid kernel BTF
   Segmentation fault (core dumped)

The backtrace pointed to:

   #0  0x00000000100fb17c in btf.type_cnt ()
   #1  0x00000000100fc1a8 in btf_find_by_name_kind ()
   #2  0x00000000100fc38c in btf.find_by_name_kind ()
   #3  0x00000000102ee3ac in off_cpu_prepare ()
   #4  0x000000001002f78c in cmd_record ()
   #5  0x00000000100aee78 in run_builtin ()
   #6  0x00000000100af3e4 in handle_internal_command ()
   #7  0x000000001001004c in main ()

Code sequence is:

   static void check_sched_switch_args(void)
   {
        struct btf *btf = btf__load_vmlinux_btf();
        const struct btf_type *t1, *t2, *t3;
        u32 type_id;

        type_id = btf__find_by_name_kind(btf, "btf_trace_sched_switch",
                                         BTF_KIND_TYPEDEF);

btf__load_vmlinux_btf() fails when CONFIG_DEBUG_INFO_BTF is not enabled.

Here bpf__find_by_name_kind() calls btf__type_cnt() with NULL btf value
and results in segfault.

To fix this, add a check to see if btf is not NULL before invoking
bpf__find_by_name_kind().

    Reviewed-by: Namhyung Kim <[email protected]>
    Signed-off-by: Athira Rajeev <[email protected]>
    Cc: Adrian Hunter <[email protected]>
    Cc: Disha Goel <[email protected]>
    Cc: Hari Bathini <[email protected]>
    Cc: Ian Rogers <[email protected]>
    Cc: Jiri Olsa <[email protected]>
    Cc: Kajol Jain <[email protected]>
    Cc: Madhavan Srinivasan <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

Signed-off-by: Michael Petlan <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 5, 2025
JIRA: https://issues.redhat.com/browse/RHEL-77936

upstream
========
commit c7b87ce
Author: Howard Chu <[email protected]>
Date: Tue Jan 21 18:55:19 2025 -0800

description
===========
libtraceevent parses and returns an array of argument fields, sometimes
larger than RAW_SYSCALL_ARGS_NUM (6) because it includes "__syscall_nr",
idx will traverse to index 6 (7th element) whereas sc->fmt->arg holds 6
elements max, creating an out-of-bounds access. This runtime error is
found by UBsan. The error message:

  $ sudo UBSAN_OPTIONS=print_stacktrace=1 ./perf trace -a --max-events=1
  builtin-trace.c:1966:35: runtime error: index 6 out of bounds for type 'syscall_arg_fmt [6]'
    #0 0x5c04956be5fe in syscall__alloc_arg_fmts /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:1966
    #1 0x5c04956c0510 in trace__read_syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2110
    #2 0x5c04956c372b in trace__syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2436
    #3 0x5c04956d2f39 in trace__init_syscalls_bpf_prog_array_maps /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:3897
    #4 0x5c04956d6d25 in trace__run /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:4335
    #5 0x5c04956e112e in cmd_trace /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:5502
    #6 0x5c04956eda7d in run_builtin /home/howard/hw/linux-perf/tools/perf/perf.c:351
    #7 0x5c04956ee0a8 in handle_internal_command /home/howard/hw/linux-perf/tools/perf/perf.c:404
    #8 0x5c04956ee37f in run_argv /home/howard/hw/linux-perf/tools/perf/perf.c:448
    #9 0x5c04956ee8e9 in main /home/howard/hw/linux-perf/tools/perf/perf.c:556
    #10 0x79eb3622a3b7 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #11 0x79eb3622a47a in __libc_start_main_impl ../csu/libc-start.c:360
    #12 0x5c04955422d4 in _start (/home/howard/hw/linux-perf/tools/perf/perf+0x4e02d4) (BuildId: 5b6cab2d59e96a4341741765ad6914a4d784dbc6)

     0.000 ( 0.014 ms): Chrome_ChildIO/117244 write(fd: 238, buf: !, count: 1)                                      = 1

Fixes: 5e58fcf ("perf trace: Allow allocating sc->arg_fmt even without the syscall tracepoint")
    Signed-off-by: Howard Chu <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Namhyung Kim <[email protected]>

Signed-off-by: Michael Petlan <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 5, 2025
JIRA: https://issues.redhat.com/browse/RHEL-77936

upstream
========
commit 888751e
Author: Thomas Richter <[email protected]>
Date: Fri Jan 31 12:24:00 2025 +0100

description
===========
perf test 11 hwmon fails on s390 with this error

 # ./perf test -Fv 11
 --- start ---
 ---- end ----
 11.1: Basic parsing test             : Ok
 --- start ---
 Testing 'temp_test_hwmon_event1'
 Using CPUID IBM,3931,704,A01,3.7,002f
 temp_test_hwmon_event1 -> hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/
 FAILED tests/hwmon_pmu.c:189 Unexpected config for
    'temp_test_hwmon_event1', 292470092988416 != 655361
 ---- end ----
 11.2: Parsing without PMU name       : FAILED!
 --- start ---
 Testing 'hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/'
 FAILED tests/hwmon_pmu.c:189 Unexpected config for
    'hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/',
    292470092988416 != 655361
 ---- end ----
 11.3: Parsing with PMU name          : FAILED!
 #

The root cause is in member test_event::config which is initialized
to 0xA0001 or 655361. During event parsing a long list event parsing
functions are called and end up with this gdb call stack:

 #0  hwmon_pmu__config_term (hwm=0x168dfd0, attr=0x3ffffff5ee8,
	term=0x168db60, err=0x3ffffff81c8) at util/hwmon_pmu.c:623
 #1  hwmon_pmu__config_terms (pmu=0x168dfd0, attr=0x3ffffff5ee8,
	terms=0x3ffffff5ea8, err=0x3ffffff81c8) at util/hwmon_pmu.c:662
 #2  0x00000000012f870c in perf_pmu__config_terms (pmu=0x168dfd0,
	attr=0x3ffffff5ee8, terms=0x3ffffff5ea8, zero=false,
	apply_hardcoded=false, err=0x3ffffff81c8) at util/pmu.c:1519
 #3  0x00000000012f88a4 in perf_pmu__config (pmu=0x168dfd0, attr=0x3ffffff5ee8,
	head_terms=0x3ffffff5ea8, apply_hardcoded=false, err=0x3ffffff81c8)
	at util/pmu.c:1545
 #4  0x00000000012680c4 in parse_events_add_pmu (parse_state=0x3ffffff7fb8,
	list=0x168dc00, pmu=0x168dfd0, const_parsed_terms=0x3ffffff6090,
	auto_merge_stats=true, alternate_hw_config=10)
	at util/parse-events.c:1508
 #5  0x00000000012684c6 in parse_events_multi_pmu_add (parse_state=0x3ffffff7fb8,
	event_name=0x168ec10 "temp_test_hwmon_event1", hw_config=10,
	const_parsed_terms=0x0, listp=0x3ffffff6230, loc_=0x3ffffff70e0)
	at util/parse-events.c:1592
 #6  0x00000000012f0e4e in parse_events_parse (_parse_state=0x3ffffff7fb8,
	scanner=0x16878c0) at util/parse-events.y:293
 #7  0x00000000012695a0 in parse_events__scanner (str=0x3ffffff81d8
	"temp_test_hwmon_event1", input=0x0, parse_state=0x3ffffff7fb8)
	at util/parse-events.c:1867
 #8  0x000000000126a1e8 in __parse_events (evlist=0x168b580,
	str=0x3ffffff81d8 "temp_test_hwmon_event1", pmu_filter=0x0,
	err=0x3ffffff81c8, fake_pmu=false, warn_if_reordered=true,
	fake_tp=false) at util/parse-events.c:2136
 #9  0x00000000011e36aa in parse_events (evlist=0x168b580,
	str=0x3ffffff81d8 "temp_test_hwmon_event1", err=0x3ffffff81c8)
	at /root/linux/tools/perf/util/parse-events.h:41
 #10 0x00000000011e3e64 in do_test (i=0, with_pmu=false, with_alias=false)
	at tests/hwmon_pmu.c:164
 #11 0x00000000011e422c in test__hwmon_pmu (with_pmu=false)
	at tests/hwmon_pmu.c:219
 #12 0x00000000011e431c in test__hwmon_pmu_without_pmu (test=0x1610368
	<suite.hwmon_pmu>, subtest=1) at tests/hwmon_pmu.c:23

where the attr::config is set to value 292470092988416 or 0x10a0000000000
in line 625 of file ./util/hwmon_pmu.c:

   attr->config = key.type_and_num;

However member key::type_and_num is defined as union and bit field:

   union hwmon_pmu_event_key {
        long type_and_num;
        struct {
                int num :16;
                enum hwmon_type type :8;
        };
   };

s390 is big endian and Intel is little endian architecture.
The events for the hwmon dummy pmu have num = 1 or num = 2 and
type is set to HWMON_TYPE_TEMP (which is 10).
On s390 this assignes member key::type_and_num the value of
0x10a0000000000 (which is 292470092988416) as shown in above
trace output.

Fix this and export the structure/union hwmon_pmu_event_key
so the test shares the same implementation as the event parsing
functions for union and bit fields. This should avoid
endianess issues on all platforms.

Output after:
 # ./perf test -F 11
 11.1: Basic parsing test         : Ok
 11.2: Parsing without PMU name   : Ok
 11.3: Parsing with PMU name      : Ok
 #

Fixes: 531ee0f ("perf test: Add hwmon "PMU" test")
    Signed-off-by: Thomas Richter <[email protected]>
    Reviewed-by: Ian Rogers <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Namhyung Kim <[email protected]>

Signed-off-by: Michael Petlan <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 5, 2025
JIRA: https://issues.redhat.com/browse/RHEL-78701

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 93ae6e6
Author: Lu Baolu <[email protected]>
Date:   Wed Mar 19 10:21:01 2025 +0800

    iommu/vt-d: Fix possible circular locking dependency

    We have recently seen report of lockdep circular lock dependency warnings
    on platforms like Skylake and Kabylake:

     ======================================================
     WARNING: possible circular locking dependency detected
     6.14.0-rc6-CI_DRM_16276-gca2c04fe76e8+ #1 Not tainted
     ------------------------------------------------------
     swapper/0/1 is trying to acquire lock:
     ffffffff8360ee48 (iommu_probe_device_lock){+.+.}-{3:3},
       at: iommu_probe_device+0x1d/0x70

     but task is already holding lock:
     ffff888102c7efa8 (&device->physical_node_lock){+.+.}-{3:3},
       at: intel_iommu_init+0xe75/0x11f0

     which lock already depends on the new lock.

     the existing dependency chain (in reverse order) is:

     -> #6 (&device->physical_node_lock){+.+.}-{3:3}:
            __mutex_lock+0xb4/0xe40
            mutex_lock_nested+0x1b/0x30
            intel_iommu_init+0xe75/0x11f0
            pci_iommu_init+0x13/0x70
            do_one_initcall+0x62/0x3f0
            kernel_init_freeable+0x3da/0x6a0
            kernel_init+0x1b/0x200
            ret_from_fork+0x44/0x70
            ret_from_fork_asm+0x1a/0x30

     -> #5 (dmar_global_lock){++++}-{3:3}:
            down_read+0x43/0x1d0
            enable_drhd_fault_handling+0x21/0x110
            cpuhp_invoke_callback+0x4c6/0x870
            cpuhp_issue_call+0xbf/0x1f0
            __cpuhp_setup_state_cpuslocked+0x111/0x320
            __cpuhp_setup_state+0xb0/0x220
            irq_remap_enable_fault_handling+0x3f/0xa0
            apic_intr_mode_init+0x5c/0x110
            x86_late_time_init+0x24/0x40
            start_kernel+0x895/0xbd0
            x86_64_start_reservations+0x18/0x30
            x86_64_start_kernel+0xbf/0x110
            common_startup_64+0x13e/0x141

     -> #4 (cpuhp_state_mutex){+.+.}-{3:3}:
            __mutex_lock+0xb4/0xe40
            mutex_lock_nested+0x1b/0x30
            __cpuhp_setup_state_cpuslocked+0x67/0x320
            __cpuhp_setup_state+0xb0/0x220
            page_alloc_init_cpuhp+0x2d/0x60
            mm_core_init+0x18/0x2c0
            start_kernel+0x576/0xbd0
            x86_64_start_reservations+0x18/0x30
            x86_64_start_kernel+0xbf/0x110
            common_startup_64+0x13e/0x141

     -> #3 (cpu_hotplug_lock){++++}-{0:0}:
            __cpuhp_state_add_instance+0x4f/0x220
            iova_domain_init_rcaches+0x214/0x280
            iommu_setup_dma_ops+0x1a4/0x710
            iommu_device_register+0x17d/0x260
            intel_iommu_init+0xda4/0x11f0
            pci_iommu_init+0x13/0x70
            do_one_initcall+0x62/0x3f0
            kernel_init_freeable+0x3da/0x6a0
            kernel_init+0x1b/0x200
            ret_from_fork+0x44/0x70
            ret_from_fork_asm+0x1a/0x30

     -> #2 (&domain->iova_cookie->mutex){+.+.}-{3:3}:
            __mutex_lock+0xb4/0xe40
            mutex_lock_nested+0x1b/0x30
            iommu_setup_dma_ops+0x16b/0x710
            iommu_device_register+0x17d/0x260
            intel_iommu_init+0xda4/0x11f0
            pci_iommu_init+0x13/0x70
            do_one_initcall+0x62/0x3f0
            kernel_init_freeable+0x3da/0x6a0
            kernel_init+0x1b/0x200
            ret_from_fork+0x44/0x70
            ret_from_fork_asm+0x1a/0x30

     -> #1 (&group->mutex){+.+.}-{3:3}:
            __mutex_lock+0xb4/0xe40
            mutex_lock_nested+0x1b/0x30
            __iommu_probe_device+0x24c/0x4e0
            probe_iommu_group+0x2b/0x50
            bus_for_each_dev+0x7d/0xe0
            iommu_device_register+0xe1/0x260
            intel_iommu_init+0xda4/0x11f0
            pci_iommu_init+0x13/0x70
            do_one_initcall+0x62/0x3f0
            kernel_init_freeable+0x3da/0x6a0
            kernel_init+0x1b/0x200
            ret_from_fork+0x44/0x70
            ret_from_fork_asm+0x1a/0x30

     -> #0 (iommu_probe_device_lock){+.+.}-{3:3}:
            __lock_acquire+0x1637/0x2810
            lock_acquire+0xc9/0x300
            __mutex_lock+0xb4/0xe40
            mutex_lock_nested+0x1b/0x30
            iommu_probe_device+0x1d/0x70
            intel_iommu_init+0xe90/0x11f0
            pci_iommu_init+0x13/0x70
            do_one_initcall+0x62/0x3f0
            kernel_init_freeable+0x3da/0x6a0
            kernel_init+0x1b/0x200
            ret_from_fork+0x44/0x70
            ret_from_fork_asm+0x1a/0x30

     other info that might help us debug this:

     Chain exists of:
       iommu_probe_device_lock --> dmar_global_lock -->
         &device->physical_node_lock

      Possible unsafe locking scenario:

            CPU0                    CPU1
            ----                    ----
       lock(&device->physical_node_lock);
                                    lock(dmar_global_lock);
                                    lock(&device->physical_node_lock);
       lock(iommu_probe_device_lock);

      *** DEADLOCK ***

    This driver uses a global lock to protect the list of enumerated DMA
    remapping units. It is necessary due to the driver's support for dynamic
    addition and removal of remapping units at runtime.

    Two distinct code paths require iteration over this remapping unit list:

    - Device registration and probing: the driver iterates the list to
      register each remapping unit with the upper layer IOMMU framework
      and subsequently probe the devices managed by that unit.
    - Global configuration: Upper layer components may also iterate the list
      to apply configuration changes.

    The lock acquisition order between these two code paths was reversed. This
    caused lockdep warnings, indicating a risk of deadlock. Fix this warning
    by releasing the global lock before invoking upper layer interfaces for
    device registration.

    Fixes: b150654 ("iommu/vt-d: Fix suspicious RCU usage")
    Closes: https://lore.kernel.org/linux-iommu/SJ1PR11MB612953431F94F18C954C4A9CB9D32@SJ1PR11MB6129.namprd11.prod.outlook.com/
    Tested-by: Chaitanya Kumar Borah <[email protected]>
    Cc: [email protected]
    Signed-off-by: Lu Baolu <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Joerg Roedel <[email protected]>

Signed-off-by: Eder Zulian <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 5, 2025
Add a compile-time check that `*$ptr` is of the type of `$type->$($f)*`.
Rename those placeholders for clarity.

Given the incorrect usage:

> diff --git a/rust/kernel/rbtree.rs b/rust/kernel/rbtree.rs
> index 8d978c8..6a7089149878 100644
> --- a/rust/kernel/rbtree.rs
> +++ b/rust/kernel/rbtree.rs
> @@ -329,7 +329,7 @@ fn raw_entry(&mut self, key: &K) -> RawEntry<'_, K, V> {
>          while !(*child_field_of_parent).is_null() {
>              let curr = *child_field_of_parent;
>              // SAFETY: All links fields we create are in a `Node<K, V>`.
> -            let node = unsafe { container_of!(curr, Node<K, V>, links) };
> +            let node = unsafe { container_of!(curr, Node<K, V>, key) };
>
>              // SAFETY: `node` is a non-null node so it is valid by the type invariants.
>              match key.cmp(unsafe { &(*node).key }) {

this patch produces the compilation error:

> error[E0308]: mismatched types
>    --> rust/kernel/lib.rs:220:45
>     |
> 220 |         $crate::assert_same_type(field_ptr, (&raw const (*container_ptr).$($fields)*).cast_mut());
>     |         ------------------------ ---------  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `*mut rb_node`, found `*mut K`
>     |         |                        |
>     |         |                        expected all arguments to be this `*mut bindings::rb_node` type because they need to match the type of this parameter
>     |         arguments to this function are incorrect
>     |
>    ::: rust/kernel/rbtree.rs:270:6
>     |
> 270 | impl<K, V> RBTree<K, V>
>     |      - found this type parameter
> ...
> 332 |             let node = unsafe { container_of!(curr, Node<K, V>, key) };
>     |                                 ------------------------------------ in this macro invocation
>     |
>     = note: expected raw pointer `*mut bindings::rb_node`
>                found raw pointer `*mut K`
> note: function defined here
>    --> rust/kernel/lib.rs:227:8
>     |
> 227 | pub fn assert_same_type<T>(_: T, _: T) {}
>     |        ^^^^^^^^^^^^^^^^ -  ----  ---- this parameter needs to match the `*mut bindings::rb_node` type of parameter #1
>     |                         |  |
>     |                         |  parameter #2 needs to match the `*mut bindings::rb_node` type of this parameter
>     |                         parameter #1 and parameter #2 both reference this parameter `T`
>     = note: this error originates in the macro `container_of` (in Nightly builds, run with -Z macro-backtrace for more info)

[ We decided to go with a variation of v1 [1] that became v4, since it
  seems like the obvious approach, the error messages seem good enough
  and the debug performance should be fine, given the kernel is always
  built with -O2.

  In the future, we may want to make the helper non-hidden, with
  proper documentation, for others to use.

  [1] https://lore.kernel.org/rust-for-linux/CANiq72kQWNfSV0KK6qs6oJt+aGdgY=hXg=wJcmK3zYcokY1LNw@mail.gmail.com/

    - Miguel ]

Suggested-by: Alice Ryhl <[email protected]>
Link: https://lore.kernel.org/all/CAH5fLgh6gmqGBhPMi2SKn7mCmMWfOSiS0WP5wBuGPYh9ZTAiww@mail.gmail.com/
Signed-off-by: Tamir Duberstein <[email protected]>
Reviewed-by: Benno Lossin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Added intra-doc link. - Miguel ]
Signed-off-by: Miguel Ojeda <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 14, 2025
…/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.16, take #2

- Rework of system register accessors for system registers that are
  directly writen to memory, so that sanitisation of the in-memory
  value happens at the correct time (after the read, or before the
  write). For convenience, RMW-style accessors are also provided.

- Multiple fixes for the so-called "arch-timer-edge-cases' selftest,
  which was always broken.
github-actions bot pushed a commit that referenced this pull request Jun 17, 2025
JIRA: https://issues.redhat.com/browse/RHEL-80338

commit 32ed120
Author: Mark Rutland <[email protected]>
Date: Wed, 11 Dec 2024 14:07:03 +0000

    Aishwarya reports that warnings are sometimes seen when running the
    ftrace kselftests, e.g.

    | WARNING: CPU: 5 PID: 2066 at arch/arm64/kernel/stacktrace.c:141 arch_stack_walk+0x4a0/0x4c0
    | Modules linked in:
    | CPU: 5 UID: 0 PID: 2066 Comm: ftracetest Not tainted 6.13.0-rc2 #2
    | Hardware name: linux,dummy-virt (DT)
    | pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    | pc : arch_stack_walk+0x4a0/0x4c0
    | lr : arch_stack_walk+0x248/0x4c0
    | sp : ffff800083643d20
    | x29: ffff800083643dd0 x28: ffff00007b891400 x27: ffff00007b891928
    | x26: 0000000000000001 x25: 00000000000000c0 x24: ffff800082f39d80
    | x23: ffff80008003ee8c x22: ffff80008004baa8 x21: ffff8000800533e0
    | x20: ffff800083643e10 x19: ffff80008003eec8 x18: 0000000000000000
    | x17: 0000000000000000 x16: ffff800083640000 x15: 0000000000000000
    | x14: 02a37a802bbb8a92 x13: 00000000000001a9 x12: 0000000000000001
    | x11: ffff800082ffad60 x10: ffff800083643d20 x9 : ffff80008003eed0
    | x8 : ffff80008004baa8 x7 : ffff800086f2be80 x6 : ffff0000057cf000
    | x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff800086f2b690
    | x2 : ffff80008004baa8 x1 : ffff80008004baa8 x0 : ffff80008004baa8
    | Call trace:
    |  arch_stack_walk+0x4a0/0x4c0 (P)
    |  arch_stack_walk+0x248/0x4c0 (L)
    |  profile_pc+0x44/0x80
    |  profile_tick+0x50/0x80 (F)
    |  tick_nohz_handler+0xcc/0x160 (F)
    |  __hrtimer_run_queues+0x2ac/0x340 (F)
    |  hrtimer_interrupt+0xf4/0x268 (F)
    |  arch_timer_handler_virt+0x34/0x60 (F)
    |  handle_percpu_devid_irq+0x88/0x220 (F)
    |  generic_handle_domain_irq+0x34/0x60 (F)
    |  gic_handle_irq+0x54/0x140 (F)
    |  call_on_irq_stack+0x24/0x58 (F)
    |  do_interrupt_handler+0x88/0x98
    |  el1_interrupt+0x34/0x68 (F)
    |  el1h_64_irq_handler+0x18/0x28
    |  el1h_64_irq+0x6c/0x70
    |  queued_spin_lock_slowpath+0x78/0x460 (P)

    The warning in question is:

      WARN_ON_ONCE(state->common.pc == orig_pc))

    ... in kunwind_recover_return_address(), which is triggered when
    return_to_handler() is encountered in the trace, but
    ftrace_graph_ret_addr() cannot find a corresponding original return
    address on the fgraph return stack.

    This happens because the stacktrace code encounters an exception
    boundary where the LR was not live at the time of the exception, but the
    LR happens to contain return_to_handler(); either because the task
    recently returned there, or due to unfortunate usage of the LR at a
    scratch register. In such cases attempts to recover the return address
    via ftrace_graph_ret_addr() may fail, triggering the WARN_ON_ONCE()
    above and aborting the unwind (hence the stacktrace terminating after
    reporting the PC at the time of the exception).

    Handling unreliable LR values in these cases is likely to require some
    larger rework, so for the moment avoid this problem by restoring the old
    behaviour of skipping the LR at exception boundaries, which the
    stacktrace code did prior to commit:

      c2c6b27 ("arm64: stacktrace: unwind exception boundaries")

    This commit is effectively a partial revert, keeping the structures and
    logic to explicitly identify exception boundaries while still skipping
    reporting of the LR. The logic to explicitly identify exception
    boundaries is still useful for general robustness and as a building
    block for future support for RELIABLE_STACKTRACE.

    Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries")
    Signed-off-by: Mark Rutland <[email protected]>
    Reported-by: Aishwarya TCV <[email protected]>
    Cc: Will Deacon <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Catalin Marinas <[email protected]>

Signed-off-by: Mark Salter <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 17, 2025
JIRA: https://issues.redhat.com/browse/RHEL-80338

commit 65ac33b
Author: Mark Rutland <[email protected]>
Date: Wed, 11 Dec 2024 14:07:04 +0000

    The arm64 stacktrace code has a few error conditions where a
    WARN_ON_ONCE() is triggered before the stacktrace is terminated and an
    error is returned to the caller. The conditions shouldn't be triggered
    when unwinding the current task, but it is possible to trigger these
    when unwinding another task which is not blocked, as the stack of that
    task is concurrently modified. Kent reports that these warnings can be
    triggered while running filesystem tests on bcachefs, which calls the
    stacktrace code directly.

    To produce a meaningful stacktrace of another task, the task in question
    should be blocked, but the stacktrace code is expected to be robust to
    cases where it is not blocked. Note that this is purely about not
    unuduly scaring the user and/or crashing the kernel; stacktraces in such
    cases are meaningless and may leak kernel secrets from the stack of the
    task being unwound.

    Ideally we'd pin the task in a blocked state during the unwind, as we do
    for /proc/${PID}/wchan since commit:

      42a20f8 ("sched: Add wrapper for get_wchan() to keep task blocked")

    ... but a bunch of places don't do that, notably /proc/${PID}/stack,
    where we don't pin the task in a blocked state, but do restrict the
    output to privileged users since commit:

      f8a00ce ("proc: restrict kernel stack dumps to root")

    ... and so it's possible to trigger these warnings accidentally, e.g. by
    reading /proc/*/stack (as root):

    | for n in $(seq 1 10); do
    |     while true; do cat /proc/*/stack > /dev/null 2>&1; done &
    | done
    | ------------[ cut here ]------------
    | WARNING: CPU: 3 PID: 166 at arch/arm64/kernel/stacktrace.c:207 arch_stack_walk+0x1c8/0x370
    | Modules linked in:
    | CPU: 3 UID: 0 PID: 166 Comm: cat Not tainted 6.13.0-rc2-00003-g3dafa7a7925d #2
    | Hardware name: linux,dummy-virt (DT)
    | pstate: 81400005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
    | pc : arch_stack_walk+0x1c8/0x370
    | lr : arch_stack_walk+0x1b0/0x370
    | sp : ffff800080773890
    | x29: ffff800080773930 x28: fff0000005c44500 x27: fff00000058fa038
    | x26: 000000007ffff000 x25: 0000000000000000 x24: 0000000000000000
    | x23: ffffa35a8d9600ec x22: 0000000000000000 x21: fff00000043a33c0
    | x20: ffff800080773970 x19: ffffa35a8d960168 x18: 0000000000000000
    | x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
    | x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
    | x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
    | x8 : ffff8000807738e0 x7 : ffff8000806e3800 x6 : ffff8000806e3818
    | x5 : ffff800080773920 x4 : ffff8000806e4000 x3 : ffff8000807738e0
    | x2 : 0000000000000018 x1 : ffff8000806e3800 x0 : 0000000000000000
    | Call trace:
    |  arch_stack_walk+0x1c8/0x370 (P)
    |  stack_trace_save_tsk+0x8c/0x108
    |  proc_pid_stack+0xb0/0x134
    |  proc_single_show+0x60/0x120
    |  seq_read_iter+0x104/0x438
    |  seq_read+0xf8/0x140
    |  vfs_read+0xc4/0x31c
    |  ksys_read+0x70/0x108
    |  __arm64_sys_read+0x1c/0x28
    |  invoke_syscall+0x48/0x104
    |  el0_svc_common.constprop.0+0x40/0xe0
    |  do_el0_svc+0x1c/0x28
    |  el0_svc+0x30/0xcc
    |  el0t_64_sync_handler+0x10c/0x138
    |  el0t_64_sync+0x198/0x19c
    | ---[ end trace 0000000000000000 ]---

    Fix this by only warning when unwinding the current task. When unwinding
    another task the error conditions will be handled by returning an error
    without producing a warning.

    The two warnings in kunwind_next_frame_record_meta() were added recently
    as part of commit:

      c2c6b27 ("arm64: stacktrace: unwind exception boundaries")

    The warning when recovering the fgraph return address has changed form
    many times, but was originally introduced back in commit:

      9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")

    Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries")
    Fixes: 9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")
    Signed-off-by: Mark Rutland <[email protected]>
    Reported-by: Kent Overstreet <[email protected]>
    Cc: Kees Cook <[email protected]>
    Cc: Peter Zijlstra <[email protected]>
    Cc: Will Deacon <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Catalin Marinas <[email protected]>

Signed-off-by: Mark Salter <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants