kernel: workq: introduce work timeout: #88345

bjarki-andreasen · 2025-04-09T08:46:16Z

Introduce work timeout, which is an optional workqueue configuration which enables monitoring for work items which take longer than expected. This could be due to long running or deadlocked handlers.

This is a far more permissive alternative to #87522. It allows blocking items, as long as they don't take longer than specified. Feature works on a per workqueue basis. If the workqueue is blocked, an ERR log will be printed and the work queue thread will be aborted.

Example output from test suite

[00:00:01.010,000] <wrn> os: queue sysworkq blocked by work 0x805a0e0 with handler 0x80494a6

if the aborted thread is the essential system workqueue thread, kernel explodes.

kernel/Kconfig

bjarki-andreasen · 2025-04-09T09:21:39Z

added option to exclude the timeout entirely so the feature has zero overhead if not used.

andyross

Nitpicks, but this seems unobjectionable

andyross · 2025-04-09T18:20:51Z

kernel/Kconfig

@@ -600,6 +603,13 @@ config SYSTEM_WORKQUEUE_NO_YIELD
 	  cooperative and a sequence of work items is expected to complete
 	  without yielding.

+config SYSTEM_WORKQUEUE_WORK_TIMEOUT_MS
+	int "Select system work queue work timeout in milliseconds"
+	default 10000 if DEBUG


CONFIG_DEBUG is a control over compiler optimizations, it's the wrong tunable to use here. You probably want CONFIG_ASSERT to gate the default.

Changed to ASSERT and decreased the time to 5000ms (still far longer than I think is reasonable but hopefully users will adjust it lower)

andyross · 2025-04-09T18:23:46Z

kernel/work.c

+bool k_sys_work_queue_is_blocked(void)
+{
+	return flag_test(&k_sys_work_q.flags, K_WORK_QUEUE_BLOCKED_BIT);
+}


Not really seeing the point of this as an application API? I mean, you're not supposed to be blocked at all. One doesn't normally write is_my_code_broken() predicates, nor call them in working code.

This is for use in tests, can move it to an internal header :)

If we use k_oops() on block this API can be removed entirely

API removed, thread is aborted if blocked, so a user who cares could join or check if the work queue thread is running :)

andyross · 2025-04-09T18:25:20Z

kernel/work.c

+	if (name != NULL) {
+		LOG_WRN("queue %s blocked by work %p with handler %p", name, work, handler);
+	} else {
+		LOG_WRN("queue %p blocked by work %p with handler %p", queue, work, handler);


Seems like the k_oops() from the last PR might be a better choice, though maybe configurable. A mere warning message isn't likely loud enough for something we almost all agree is a should-never-happen condition.

Alternatively make the callback be settable by the app and have the default blow up, but let apps do what they want?

I will add k_oops() (or panic?), this will also make the code a tiny but simpler since there is no unblock scenario :)

changed strategy a bit, since the timeout is run from a _timeout, k_oops() makes no sense since it is not run from the same thread as the work queue, so instead, abort the work queue and let the kernel handle it (essential thread would result in k_panic())

kernel/work.c

include/zephyr/kernel.h

kernel/Kconfig

kernel/system_work_q.c

kernel/Kconfig

cfriedt · 2025-04-10T10:15:34Z

Some more nitpicks - sorry - otherwise looking good

cfriedt · 2025-04-10T11:59:13Z

Also, would be really good to add a test for this!

andyross

I'm out of nitpicks, this looks very reasonable

cfriedt

Lingering nits, but I guess I had better not block

kernel/Kconfig

cfriedt · 2025-04-10T14:00:00Z

kernel/work.c

+	if (name != NULL) {
+		LOG_ERR("queue %s blocked by work %p with handler %p", name, work, handler);
+	} else {
+		LOG_ERR("queue %p blocked by work %p with handler %p", queue, work, handler);
+	}


Still not crazy about this, as it needlessly duplicates a nearly identical string in the string table.

How would you solve it? the string is about 30 bytes, and stored in ROM, the added operations of conditionally copying the thread name or a queue pointer, to a RAM buffer, to then pass to the LOG_ERR could easily take up the same or more ROM (while also adding complexity, the worst kind, involving null terminated strings)

(LOG_ERR is probably adding code as well given the use of VARGS)

How would you solve it?

Using the code suggestion here:
#88345 (comment)

@bjarki-andreasen - can you address this comment as well?

cfriedt · 2025-06-03T10:25:07Z

Oh wow - I almost forgot about this one.

@bjarki-andreasen - I'll stamp if you can address the latest comments.

Bring in the following changes from the rust module: dd73abc242e zephyr: work: Allow struct to have a additional fields 898662c0889 zephyr-sys: Handle deflected GPIO constants 174ded53bd6 ci-manifest.yml: Add cmsis_6 This should fix CI for Rust. In addition, the allowing struct to have addition fields should unblock zephyrproject-rtos#88345. Signed-off-by: David Brown <[email protected]>

cfriedt

I"m kind of wondering why we don't use k_timeout_t here

include/zephyr/kernel.h

kernel/system_work_q.c

kernel/Kconfig

kernel/work.c

Bring in the following changes from the rust module: dd73abc242e zephyr: work: Allow struct to have a additional fields 898662c0889 zephyr-sys: Handle deflected GPIO constants 174ded53bd6 ci-manifest.yml: Add cmsis_6 This should fix CI for Rust. In addition, the allowing struct to have addition fields should unblock #88345. Signed-off-by: David Brown <[email protected]>

Introduce work timeout, which is an optional workqueue configuration which enables monitoring for work items which take longer than expected. This could be due to long running or deadlocked handlers. Signed-off-by: Bjarki Arge Andreasen <[email protected]>

Add workqueue work timeout test to work_queue test suite. Signed-off-by: Bjarki Arge Andreasen <[email protected]>

sonarqubecloud · 2025-06-05T11:13:44Z

Quality Gate passed

Issues
4 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

bjarki-andreasen · 2025-06-05T12:36:01Z

Its finally all green!

andyross

Refresh +1

Bring in the following changes from the rust module: dd73abc242e zephyr: work: Allow struct to have a additional fields 898662c0889 zephyr-sys: Handle deflected GPIO constants 174ded53bd6 ci-manifest.yml: Add cmsis_6 This should fix CI for Rust. In addition, the allowing struct to have addition fields should unblock zephyrproject-rtos#88345. (cherry picked from commit bd4d3f8) Original-Signed-off-by: David Brown <[email protected]> GitOrigin-RevId: bd4d3f8 Cr-Build-Id: 8712950611002603889 Cr-Build-Url: https://cr-buildbucket.appspot.com/build/8712950611002603889 Copybot-Job-Name: zephyr-main-copybot-downstream Change-Id: I4b33e14a3213625dd49ac96ade22216b4935a1ad Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/zephyr/+/6622611 Commit-Queue: Dawid Niedźwiecki <[email protected]> Tested-by: ChromeOS Copybot <[email protected]> Reviewed-by: Dawid Niedźwiecki <[email protected]> Tested-by: Dawid Niedźwiecki <[email protected]>

bjarki-andreasen mentioned this pull request Apr 9, 2025

System workqueue: Prevent blocking API calls #87522

Closed

pdgendt reviewed Apr 9, 2025

View reviewed changes

kernel/Kconfig Show resolved Hide resolved

bjarki-andreasen force-pushed the workq-work-timeout branch from 7a199dc to ceb645c Compare April 9, 2025 09:20

bjarki-andreasen force-pushed the workq-work-timeout branch from ceb645c to 1e9f16f Compare April 9, 2025 13:36

bjarki-andreasen marked this pull request as ready for review April 9, 2025 13:46

github-actions bot added the area: Kernel label Apr 9, 2025

github-actions bot requested review from andyross, ceolin, cfriedt, dcpleung, nashif, npitre, peter-mitsis and TaiJuWu April 9, 2025 13:47

github-actions bot assigned andyross and peter-mitsis Apr 9, 2025

andyross reviewed Apr 9, 2025

View reviewed changes

bjarki-andreasen force-pushed the workq-work-timeout branch 2 times, most recently from f76fe37 to b575ee5 Compare April 10, 2025 07:32

cfriedt reviewed Apr 10, 2025

View reviewed changes

kernel/work.c Outdated Show resolved Hide resolved

cfriedt reviewed Apr 10, 2025

View reviewed changes

include/zephyr/kernel.h Outdated Show resolved Hide resolved

kernel/Kconfig Show resolved Hide resolved

kernel/system_work_q.c Show resolved Hide resolved

kernel/Kconfig Show resolved Hide resolved

bjarki-andreasen force-pushed the workq-work-timeout branch from b575ee5 to 417e7ab Compare April 10, 2025 13:19

andyross previously approved these changes Apr 10, 2025

View reviewed changes

cfriedt previously approved these changes Apr 10, 2025

View reviewed changes

bjarki-andreasen dismissed stale reviews from cfriedt and andyross via 90c81b4 April 11, 2025 07:14

bjarki-andreasen force-pushed the workq-work-timeout branch from 417e7ab to 90c81b4 Compare April 11, 2025 07:14

bjarki-andreasen dismissed teburd’s stale review via e95765c June 3, 2025 11:16

bjarki-andreasen force-pushed the workq-work-timeout branch 4 times, most recently from d5e44ef to 60a0a34 Compare June 4, 2025 09:18

bjarki-andreasen requested a review from nashif June 4, 2025 09:18

d3zd3z mentioned this pull request Jun 4, 2025

Rust CI fixes #91074

Merged

cfriedt reviewed Jun 4, 2025

View reviewed changes

include/zephyr/kernel.h Show resolved Hide resolved

kernel/system_work_q.c Show resolved Hide resolved

kernel/Kconfig Show resolved Hide resolved

kernel/work.c Show resolved Hide resolved

kernel/work.c Show resolved Hide resolved

bjarki-andreasen force-pushed the workq-work-timeout branch from 60a0a34 to 5568211 Compare June 5, 2025 10:11

github-actions bot removed manifest manifest-zephyr-lang-rust DNM (manifest) This PR should not be merged (controlled by action-manifest) labels Jun 5, 2025

bjarki-andreasen removed the Architecture Review Discussion in the Architecture WG required label Jun 5, 2025

bjarki-andreasen added 2 commits June 5, 2025 12:54

tests: kernel: workq: work_queue: add work timeout test

495c50a

Add workqueue work timeout test to work_queue test suite. Signed-off-by: Bjarki Arge Andreasen <[email protected]>

bjarki-andreasen force-pushed the workq-work-timeout branch from 5568211 to 495c50a Compare June 5, 2025 10:56

bjarki-andreasen requested review from cfriedt, pdgendt, teburd and andyross June 5, 2025 12:36

andyross approved these changes Jun 5, 2025

View reviewed changes

cfriedt approved these changes Jun 5, 2025

View reviewed changes

kartben merged commit 73bf428 into zephyrproject-rtos:main Jun 6, 2025
30 checks passed

github-project-automation bot moved this from Todo to Done in Architecture Review Jun 6, 2025

kernel: workq: introduce work timeout: #88345

kernel: workq: introduce work timeout: #88345

Uh oh!

Conversation

bjarki-andreasen commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bjarki-andreasen commented Apr 9, 2025

Uh oh!

andyross left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjarki-andreasen Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjarki-andreasen Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cfriedt commented Apr 10, 2025

Uh oh!

cfriedt commented Apr 10, 2025

Uh oh!

andyross left a comment

Choose a reason for hiding this comment

Uh oh!

cfriedt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjarki-andreasen Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cfriedt Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cfriedt commented Jun 3, 2025

Uh oh!

cfriedt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Jun 5, 2025

Quality Gate passed

Uh oh!

bjarki-andreasen commented Apr 9, 2025 •

edited

Loading

bjarki-andreasen Apr 10, 2025 •

edited

Loading

bjarki-andreasen Apr 10, 2025 •

edited

Loading

bjarki-andreasen Apr 11, 2025 •

edited

Loading

cfriedt Jun 4, 2025 •

edited

Loading