Skip to content

kernel: workq: introduce work timeout: #88345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

bjarki-andreasen
Copy link
Collaborator

@bjarki-andreasen bjarki-andreasen commented Apr 9, 2025

Introduce work timeout, which is an optional workqueue configuration which enables monitoring for work items which take longer than expected. This could be due to long running or deadlocked handlers.

This is a far more permissive alternative to #87522. It allows blocking items, as long as they don't take longer than specified. Feature works on a per workqueue basis. If the workqueue is blocked, an ERR log will be printed and the work queue thread will be aborted.

Example output from test suite

[00:00:01.010,000] <wrn> os: queue sysworkq blocked by work 0x805a0e0 with handler 0x80494a6

if the aborted thread is the essential system workqueue thread, kernel explodes.

@bjarki-andreasen
Copy link
Collaborator Author

added option to exclude the timeout entirely so the feature has zero overhead if not used.

Copy link
Collaborator

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicks, but this seems unobjectionable

kernel/Kconfig Outdated
@@ -600,6 +603,13 @@ config SYSTEM_WORKQUEUE_NO_YIELD
cooperative and a sequence of work items is expected to complete
without yielding.

config SYSTEM_WORKQUEUE_WORK_TIMEOUT_MS
int "Select system work queue work timeout in milliseconds"
default 10000 if DEBUG
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONFIG_DEBUG is a control over compiler optimizations, it's the wrong tunable to use here. You probably want CONFIG_ASSERT to gate the default.

Copy link
Collaborator Author

@bjarki-andreasen bjarki-andreasen Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to ASSERT and decreased the time to 5000ms (still far longer than I think is reasonable but hopefully users will adjust it lower)

kernel/work.c Outdated
bool k_sys_work_queue_is_blocked(void)
{
return flag_test(&k_sys_work_q.flags, K_WORK_QUEUE_BLOCKED_BIT);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really seeing the point of this as an application API? I mean, you're not supposed to be blocked at all. One doesn't normally write is_my_code_broken() predicates, nor call them in working code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for use in tests, can move it to an internal header :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use k_oops() on block this API can be removed entirely

Copy link
Collaborator Author

@bjarki-andreasen bjarki-andreasen Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API removed, thread is aborted if blocked, so a user who cares could join or check if the work queue thread is running :)

kernel/work.c Outdated
if (name != NULL) {
LOG_WRN("queue %s blocked by work %p with handler %p", name, work, handler);
} else {
LOG_WRN("queue %p blocked by work %p with handler %p", queue, work, handler);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the k_oops() from the last PR might be a better choice, though maybe configurable. A mere warning message isn't likely loud enough for something we almost all agree is a should-never-happen condition.

Alternatively make the callback be settable by the app and have the default blow up, but let apps do what they want?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add k_oops() (or panic?), this will also make the code a tiny but simpler since there is no unblock scenario :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed strategy a bit, since the timeout is run from a _timeout, k_oops() makes no sense since it is not run from the same thread as the work queue, so instead, abort the work queue and let the kernel handle it (essential thread would result in k_panic())

@bjarki-andreasen bjarki-andreasen force-pushed the workq-work-timeout branch 2 times, most recently from f76fe37 to b575ee5 Compare April 10, 2025 07:32
@cfriedt
Copy link
Member

cfriedt commented Apr 10, 2025

Some more nitpicks - sorry - otherwise looking good

@cfriedt
Copy link
Member

cfriedt commented Apr 10, 2025

Also, would be really good to add a test for this!

andyross
andyross previously approved these changes Apr 10, 2025
Copy link
Collaborator

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm out of nitpicks, this looks very reasonable

cfriedt
cfriedt previously approved these changes Apr 10, 2025
Copy link
Member

@cfriedt cfriedt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lingering nits, but I guess I had better not block

kernel/work.c Outdated
Comment on lines 620 to 623
if (name != NULL) {
LOG_ERR("queue %s blocked by work %p with handler %p", name, work, handler);
} else {
LOG_ERR("queue %p blocked by work %p with handler %p", queue, work, handler);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not crazy about this, as it needlessly duplicates a nearly identical string in the string table.

Copy link
Collaborator Author

@bjarki-andreasen bjarki-andreasen Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you solve it? the string is about 30 bytes, and stored in ROM, the added operations of conditionally copying the thread name or a queue pointer, to a RAM buffer, to then pass to the LOG_ERR could easily take up the same or more ROM (while also adding complexity, the worst kind, involving null terminated strings)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(LOG_ERR is probably adding code as well given the use of VARGS)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you solve it?

Using the code suggestion here:
#88345 (comment)

Copy link
Member

@cfriedt cfriedt Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bjarki-andreasen - can you address this comment as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@cfriedt
Copy link
Member

cfriedt commented Jun 3, 2025

Oh wow - I almost forgot about this one.

@bjarki-andreasen - I'll stamp if you can address the latest comments.

@bjarki-andreasen bjarki-andreasen force-pushed the workq-work-timeout branch 4 times, most recently from d5e44ef to 60a0a34 Compare June 4, 2025 09:18
@bjarki-andreasen bjarki-andreasen requested a review from nashif June 4, 2025 09:18
d3zd3z added a commit to d3zd3z/zephyr that referenced this pull request Jun 4, 2025
Bring in the following changes from the rust module:

dd73abc242e zephyr: work: Allow struct to have a additional fields
898662c0889 zephyr-sys: Handle deflected GPIO constants
174ded53bd6 ci-manifest.yml: Add cmsis_6

This should fix CI for Rust.  In addition, the allowing struct to have
addition fields should unblock zephyrproject-rtos#88345.

Signed-off-by: David Brown <[email protected]>
@d3zd3z d3zd3z mentioned this pull request Jun 4, 2025
Copy link
Member

@cfriedt cfriedt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I"m kind of wondering why we don't use k_timeout_t here

nashif pushed a commit that referenced this pull request Jun 4, 2025
Bring in the following changes from the rust module:

dd73abc242e zephyr: work: Allow struct to have a additional fields
898662c0889 zephyr-sys: Handle deflected GPIO constants
174ded53bd6 ci-manifest.yml: Add cmsis_6

This should fix CI for Rust.  In addition, the allowing struct to have
addition fields should unblock #88345.

Signed-off-by: David Brown <[email protected]>
@github-actions github-actions bot removed manifest manifest-zephyr-lang-rust DNM (manifest) This PR should not be merged (controlled by action-manifest) labels Jun 5, 2025
@bjarki-andreasen bjarki-andreasen removed the Architecture Review Discussion in the Architecture WG required label Jun 5, 2025
Introduce work timeout, which is an optional workqueue configuration
which enables monitoring for work items which take longer than
expected. This could be due to long running or deadlocked handlers.

Signed-off-by: Bjarki Arge Andreasen <[email protected]>
Add workqueue work timeout test to work_queue test suite.

Signed-off-by: Bjarki Arge Andreasen <[email protected]>
Copy link

sonarqubecloud bot commented Jun 5, 2025

@bjarki-andreasen
Copy link
Collaborator Author

Its finally all green!

Copy link
Collaborator

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refresh +1

@kartben kartben merged commit 73bf428 into zephyrproject-rtos:main Jun 6, 2025
30 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in Architecture Review Jun 6, 2025
Chenhongren pushed a commit to Chenhongren/zephyr that referenced this pull request Jun 20, 2025
Bring in the following changes from the rust module:

dd73abc242e zephyr: work: Allow struct to have a additional fields
898662c0889 zephyr-sys: Handle deflected GPIO constants
174ded53bd6 ci-manifest.yml: Add cmsis_6

This should fix CI for Rust.  In addition, the allowing struct to have
addition fields should unblock zephyrproject-rtos#88345.

(cherry picked from commit bd4d3f8)

Original-Signed-off-by: David Brown <[email protected]>
GitOrigin-RevId: bd4d3f8
Cr-Build-Id: 8712950611002603889
Cr-Build-Url: https://cr-buildbucket.appspot.com/build/8712950611002603889
Copybot-Job-Name: zephyr-main-copybot-downstream
Change-Id: I4b33e14a3213625dd49ac96ade22216b4935a1ad
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/zephyr/+/6622611
Commit-Queue: Dawid Niedźwiecki <[email protected]>
Tested-by: ChromeOS Copybot <[email protected]>
Reviewed-by: Dawid Niedźwiecki <[email protected]>
Tested-by: Dawid Niedźwiecki <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

10 participants