RST-13741 Adding service client timeouts #48

ayrton04 · 2025-07-29T19:42:49Z

https://locusrobotics.atlassian.net/browse/RST-13741

Will add tests and get a bit more of the doxygen in place.

ayrton04 · 2025-07-29T19:45:40Z

Screencast.from.29-07-25.20.44.44.webm

ayrton04 · 2025-07-30T08:24:20Z

The failing tests are not due to this PR. I reverted my branch to locus-noetic-devel, and I get the same failures. I'll fix them anyway.

carlos-m159 · 2025-07-30T14:29:21Z

clients/roscpp/src/libros/service_server_link.cpp

+        this,
+        current_call,
+        timeout_sec_));
+  }


Do you really need this thread? Couldn't we use the caller thread to timeout?

As in:

if (immediate) { processNextCall(); } auto status = boost::cv_status::no_timeout; { boost::mutex::scoped_lock lock(info->finished_mutex_); while (!info->finished_) { if (timeout_sec_ >= 0) { status = info->finished_condition_.wait_for( lock, boost::chrono::duration<double>(timeout_sec_)); if (status == boost::cv_status::timeout) { bool finished = info->finished_; lock.unlock(); if (info && !finished) { // It should be safe to call cancel from this this thread. cancelCall(info); } } } else { info->finished_condition_.wait(lock); } } } info->call_finished_ = true; if (status == boost::cv_status::timeout) { // If we timeout, we need to cancel the call ROS_WARN_STREAM("Service call to " << service_name_ << " timed out after " << timeout_sec_ << " seconds"); return false; } if (info->exception_string_.length() > 0) { ROS_ERROR("Service call failed: service [%s] responded with an error: %s", service_name_.c_str(), info->exception_string_.c_str()); } return info->success_;

That will block us from making another call on that thread, but maybe that's OK?

Oh, never mind, you mean just do this in the call method. Yeah, that makes sense.

carlos-m159 · 2025-07-30T14:38:07Z

clients/roscpp/src/libros/service_server_link.cpp

-    if (current_call_->success_)
+    // If this message was cancelled, the resp_ object will no longer be pointing at a valid response object
+    // (we reset it to null)
+    if (current_call_ && current_call_->success_ && current_call_->resp_)


Isn't there a small chance that success might be set to true when a time out happens? If the receiver thread is executing. onResponseOkAndLength(). So we might need to set success to false otherwise we receive the output true on the caller side and we access a default constructed response.

Also, I think we might need to block call_queue_mutex_ on the caller thread side. So there is a chance things get written into the current_call_. Or we just return false directly if the service timed out.

I see what you're saying. I tried changing the return in call() to

return info->success_ && status != boost::cv_status::timeout;

That way, even if onResponseOKAndLength() fires after we time out, we'll still return false here.

Sorry, help me understand the other point with the call_queue_mutex_? Is it something that we've added, or is it a flaw with how the code is anyway?

Another thing to consider: let's say the call takes forever and times out for the caller, so our applications calls it again. While we're waiting for the second service call, the first call returns. Since resp_ is valid again, we might get a service response from the previous call.

Do we think that's a problem?

the other point with the call_queue_mutex_? Is it something that we've added, or is it a flaw with how the code is anyway?

I am not sure if I am reading this correctly but, assuming it does not get stuck:

Caller thread enqueues request, and waits for a signal through the condition variable finished_condition_.

Once we get a reply, on the receiver thread, it writes the response: writes the success_ while holding call_queue_mutex_

{ boost::mutex::scoped_lock lock(call_queue_mutex_); if ( ok != 0 ) { current_call_->success_ = true; } else { current_call_->success_ = false; } }

Calls onResponse(), which reads and writes into current_call_, while holding call_queue_mutex_

{ boost::mutex::scoped_lock queue_lock(call_queue_mutex_); // If this message was cancelled, the resp_ object will no longer be pointing at a valid response object // (we reset it to null) if (current_call_ && current_call_->success_ && current_call_->resp_) { *current_call_->resp_ = SerializedMessage(buffer, size); } else { current_call_->exception_string_ = std::string(reinterpret_cast<char*>(buffer.get()), size); } }

And lastly, calls callFinished() which wakes the caller thread:

{ boost::mutex::scoped_lock queue_lock(call_queue_mutex_); boost::mutex::scoped_lock finished_lock(current_call_->finished_mutex_); ROS_DEBUG_NAMED("superdebug", "Client to service [%s] call finished with success=[%s]", service_name_.c_str(), current_call_->success_ ? "true" : "false"); current_call_->finished_ = true; current_call_->finished_condition_.notify_all(); current_call_->call_finished_ = true; saved_call = current_call_; current_call_ = CallInfoPtr(); ... }

After that block, the receiver thread no longer has a access to the CallInfoPtr that the caller thread reads/accesses. But with the timeout, we can no longer ensure that it will happen.
So, all the accesses to the info after we wake-up from the timeout are currently not protected, as the receiver might be doing work on it. So, cancelCall() isn't actually safe to call. 🤔

Does this make sense to you?

Another thing to consider: let's say the call takes forever and times out for the caller, so our applications calls it again. While we're waiting for the second service call, the first call returns. Since resp_ is valid again, we might get a service response from the previous call.
Do we think that's a problem?

I don't think resp_ will be valid. Per request we have:

CallInfoPtr info(boost::make_shared<CallInfo>()); info->req_ = req; info->resp_ = &resp; info->success_ = false; info->finished_ = false; info->call_finished_ = false; info->caller_thread_id_ = boost::this_thread::get_id();

so, the receiver thread will have the old CallInfoPtr (the pointer from something we let go, since it copies it?). We will however probably wake up once it signals the condition variable. But it will basically have success = false?

I am also setting resp_ explicitly to nullptr in cancelCall and then checking for that value before I deserialise the message into it.

ayrton04 · 2025-07-30T18:27:14Z

Failing tests are all totally unrelated things that have been broken (seemingly) for ages.

ayrton04 · 2025-07-31T13:07:31Z

OK, after chatting with Carlos a bunch offline, I made things more robust by handling the case where we time out, but the server has crashed or restarted. Only the server sending data back to us could progress us to the next service call locally, so even though we weren't blocking, we were just building up our call queue endlessly in that scenario.

I am going to just let this run all day. I have one server and seven clients. Each client is running on a loop of service calls, most of which have timeouts, and one which doesn't. I occasionally kill the server (this will cause spam in the test nodes) and bring it back.

Screencast.from.31-07-25.14.03.32.webm

ayrton04 · 2025-07-31T13:50:13Z

As before, failing tests are unrelated.

ayrton04 · 2025-08-14T11:46:37Z

@carlos-m159 what's you opinion on this one? Worth merging, or too dangerous/not enough benefit?

ayrton04 · 2025-08-14T11:46:51Z

(Obviously, I'd get more reviews first)

Adding service client timeouts

256e3ec

Removing chrono header

26a7982

ayrton04 added 2 commits July 30, 2025 10:21

Fixing service exception test

4db5594

Adding tests

a9a8271

ayrton04 force-pushed the RST-13741-service-client-timeout branch from 945a697 to a9a8271 Compare July 30, 2025 09:45

ayrton04 added 2 commits July 30, 2025 11:30

Fixing tests

5a3186f

Fixing threading and sync issues

11142db

carlos-m159 reviewed Jul 30, 2025

View reviewed changes

PR feedback

3480213

ayrton04 force-pushed the RST-13741-service-client-timeout branch from 0354ba0 to 3480213 Compare July 30, 2025 17:01

Handling case where server crashes after timeout

9e3eb36

Handling issue with connection when we re-use service clients

acd69dc

RST-13741 Adding service client timeouts #48

Are you sure you want to change the base?

RST-13741 Adding service client timeouts #48

Uh oh!

Conversation

ayrton04 commented Jul 29, 2025

Uh oh!

ayrton04 commented Jul 29, 2025

Uh oh!

ayrton04 commented Jul 30, 2025

Uh oh!

carlos-m159 Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayrton04 Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

ayrton04 Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

carlos-m159 Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayrton04 Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

ayrton04 Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

carlos-m159 Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carlos-m159 Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayrton04 Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayrton04 commented Jul 30, 2025

Uh oh!

ayrton04 commented Jul 31, 2025

Uh oh!

ayrton04 commented Jul 31, 2025

Uh oh!

ayrton04 commented Aug 14, 2025

Uh oh!

ayrton04 commented Aug 14, 2025

Uh oh!

Uh oh!

carlos-m159 Jul 30, 2025 •

edited

Loading

carlos-m159 Jul 30, 2025 •

edited

Loading

carlos-m159 Jul 30, 2025 •

edited

Loading

carlos-m159 Jul 30, 2025 •

edited

Loading

ayrton04 Jul 31, 2025 •

edited

Loading