-
Notifications
You must be signed in to change notification settings - Fork 7.5k
kernel: workq: introduce work timeout: #88345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel: workq: introduce work timeout: #88345
Conversation
7a199dc
to
ceb645c
Compare
added option to exclude the timeout entirely so the feature has zero overhead if not used. |
ceb645c
to
1e9f16f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicks, but this seems unobjectionable
kernel/Kconfig
Outdated
@@ -600,6 +603,13 @@ config SYSTEM_WORKQUEUE_NO_YIELD | |||
cooperative and a sequence of work items is expected to complete | |||
without yielding. | |||
|
|||
config SYSTEM_WORKQUEUE_WORK_TIMEOUT_MS | |||
int "Select system work queue work timeout in milliseconds" | |||
default 10000 if DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CONFIG_DEBUG is a control over compiler optimizations, it's the wrong tunable to use here. You probably want CONFIG_ASSERT to gate the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to ASSERT and decreased the time to 5000ms (still far longer than I think is reasonable but hopefully users will adjust it lower)
kernel/work.c
Outdated
bool k_sys_work_queue_is_blocked(void) | ||
{ | ||
return flag_test(&k_sys_work_q.flags, K_WORK_QUEUE_BLOCKED_BIT); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really seeing the point of this as an application API? I mean, you're not supposed to be blocked at all. One doesn't normally write is_my_code_broken()
predicates, nor call them in working code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for use in tests, can move it to an internal header :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use k_oops() on block this API can be removed entirely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API removed, thread is aborted if blocked, so a user who cares could join or check if the work queue thread is running :)
kernel/work.c
Outdated
if (name != NULL) { | ||
LOG_WRN("queue %s blocked by work %p with handler %p", name, work, handler); | ||
} else { | ||
LOG_WRN("queue %p blocked by work %p with handler %p", queue, work, handler); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the k_oops() from the last PR might be a better choice, though maybe configurable. A mere warning message isn't likely loud enough for something we almost all agree is a should-never-happen condition.
Alternatively make the callback be settable by the app and have the default blow up, but let apps do what they want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add k_oops() (or panic?), this will also make the code a tiny but simpler since there is no unblock scenario :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed strategy a bit, since the timeout is run from a _timeout, k_oops() makes no sense since it is not run from the same thread as the work queue, so instead, abort the work queue and let the kernel handle it (essential thread would result in k_panic())
f76fe37
to
b575ee5
Compare
Some more nitpicks - sorry - otherwise looking good |
Also, would be really good to add a test for this! |
b575ee5
to
417e7ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm out of nitpicks, this looks very reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lingering nits, but I guess I had better not block
kernel/work.c
Outdated
if (name != NULL) { | ||
LOG_ERR("queue %s blocked by work %p with handler %p", name, work, handler); | ||
} else { | ||
LOG_ERR("queue %p blocked by work %p with handler %p", queue, work, handler); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not crazy about this, as it needlessly duplicates a nearly identical string in the string table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you solve it? the string is about 30 bytes, and stored in ROM, the added operations of conditionally copying the thread name or a queue pointer, to a RAM buffer, to then pass to the LOG_ERR could easily take up the same or more ROM (while also adding complexity, the worst kind, involving null terminated strings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(LOG_ERR is probably adding code as well given the use of VARGS)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you solve it?
Using the code suggestion here:
#88345 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bjarki-andreasen - can you address this comment as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
417e7ab
to
90c81b4
Compare
Oh wow - I almost forgot about this one. @bjarki-andreasen - I'll stamp if you can address the latest comments. |
d5e44ef
to
60a0a34
Compare
Bring in the following changes from the rust module: dd73abc242e zephyr: work: Allow struct to have a additional fields 898662c0889 zephyr-sys: Handle deflected GPIO constants 174ded53bd6 ci-manifest.yml: Add cmsis_6 This should fix CI for Rust. In addition, the allowing struct to have addition fields should unblock zephyrproject-rtos#88345. Signed-off-by: David Brown <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I"m kind of wondering why we don't use k_timeout_t
here
Bring in the following changes from the rust module: dd73abc242e zephyr: work: Allow struct to have a additional fields 898662c0889 zephyr-sys: Handle deflected GPIO constants 174ded53bd6 ci-manifest.yml: Add cmsis_6 This should fix CI for Rust. In addition, the allowing struct to have addition fields should unblock #88345. Signed-off-by: David Brown <[email protected]>
60a0a34
to
5568211
Compare
Introduce work timeout, which is an optional workqueue configuration which enables monitoring for work items which take longer than expected. This could be due to long running or deadlocked handlers. Signed-off-by: Bjarki Arge Andreasen <[email protected]>
Add workqueue work timeout test to work_queue test suite. Signed-off-by: Bjarki Arge Andreasen <[email protected]>
5568211
to
495c50a
Compare
|
Its finally all green! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refresh +1
Bring in the following changes from the rust module: dd73abc242e zephyr: work: Allow struct to have a additional fields 898662c0889 zephyr-sys: Handle deflected GPIO constants 174ded53bd6 ci-manifest.yml: Add cmsis_6 This should fix CI for Rust. In addition, the allowing struct to have addition fields should unblock zephyrproject-rtos#88345. (cherry picked from commit bd4d3f8) Original-Signed-off-by: David Brown <[email protected]> GitOrigin-RevId: bd4d3f8 Cr-Build-Id: 8712950611002603889 Cr-Build-Url: https://cr-buildbucket.appspot.com/build/8712950611002603889 Copybot-Job-Name: zephyr-main-copybot-downstream Change-Id: I4b33e14a3213625dd49ac96ade22216b4935a1ad Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/zephyr/+/6622611 Commit-Queue: Dawid Niedźwiecki <[email protected]> Tested-by: ChromeOS Copybot <[email protected]> Reviewed-by: Dawid Niedźwiecki <[email protected]> Tested-by: Dawid Niedźwiecki <[email protected]>
Introduce work timeout, which is an optional workqueue configuration which enables monitoring for work items which take longer than expected. This could be due to long running or deadlocked handlers.
This is a far more permissive alternative to #87522. It allows blocking items, as long as they don't take longer than specified. Feature works on a per workqueue basis. If the workqueue is blocked, an ERR log will be printed and the work queue thread will be aborted.
Example output from test suite
if the aborted thread is the essential system workqueue thread, kernel explodes.