-
Notifications
You must be signed in to change notification settings - Fork 1
RST-13741 Adding service client timeouts #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: locus-noetic-devel
Are you sure you want to change the base?
Conversation
Screencast.from.29-07-25.20.44.44.webm |
The failing tests are not due to this PR. I reverted my branch to |
945a697
to
a9a8271
Compare
this, | ||
current_call, | ||
timeout_sec_)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really need this thread? Couldn't we use the caller thread to timeout?
As in:
if (immediate)
{
processNextCall();
}
auto status = boost::cv_status::no_timeout;
{
boost::mutex::scoped_lock lock(info->finished_mutex_);
while (!info->finished_)
{
if (timeout_sec_ >= 0)
{
status = info->finished_condition_.wait_for(
lock, boost::chrono::duration<double>(timeout_sec_));
if (status == boost::cv_status::timeout)
{
bool finished = info->finished_;
lock.unlock();
if (info && !finished) {
// It should be safe to call cancel from this this thread.
cancelCall(info);
}
}
} else {
info->finished_condition_.wait(lock);
}
}
}
info->call_finished_ = true;
if (status == boost::cv_status::timeout)
{
// If we timeout, we need to cancel the call
ROS_WARN_STREAM("Service call to " << service_name_ << " timed out after "
<< timeout_sec_ << " seconds");
return false;
}
if (info->exception_string_.length() > 0)
{
ROS_ERROR("Service call failed: service [%s] responded with an error: %s", service_name_.c_str(), info->exception_string_.c_str());
}
return info->success_;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will block us from making another call on that thread, but maybe that's OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, never mind, you mean just do this in the call
method. Yeah, that makes sense.
if (current_call_->success_) | ||
// If this message was cancelled, the resp_ object will no longer be pointing at a valid response object | ||
// (we reset it to null) | ||
if (current_call_ && current_call_->success_ && current_call_->resp_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a small chance that success might be set to true when a time out happens? If the receiver thread is executing. onResponseOkAndLength()
. So we might need to set success to false otherwise we receive the output true on the caller side and we access a default constructed response.
Also, I think we might need to block call_queue_mutex_
on the caller thread side. So there is a chance things get written into the current_call_
. Or we just return false directly if the service timed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you're saying. I tried changing the return in call()
to
return info->success_ && status != boost::cv_status::timeout;
That way, even if onResponseOKAndLength()
fires after we time out, we'll still return false here.
Sorry, help me understand the other point with the call_queue_mutex_
? Is it something that we've added, or is it a flaw with how the code is anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing to consider: let's say the call takes forever and times out for the caller, so our applications calls it again. While we're waiting for the second service call, the first call returns. Since resp_
is valid again, we might get a service response from the previous call.
Do we think that's a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other point with the call_queue_mutex_? Is it something that we've added, or is it a flaw with how the code is anyway?
I am not sure if I am reading this correctly but, assuming it does not get stuck:
-
Caller thread enqueues request, and waits for a signal through the condition variable
finished_condition_
. -
Once we get a reply, on the receiver thread, it writes the response: writes the
success_
while holdingcall_queue_mutex_
{
boost::mutex::scoped_lock lock(call_queue_mutex_);
if ( ok != 0 ) {
current_call_->success_ = true;
} else {
current_call_->success_ = false;
}
}
- Calls
onResponse()
, which reads and writes intocurrent_call_
, while holdingcall_queue_mutex_
{
boost::mutex::scoped_lock queue_lock(call_queue_mutex_);
// If this message was cancelled, the resp_ object will no longer be pointing at a valid response object
// (we reset it to null)
if (current_call_ && current_call_->success_ && current_call_->resp_)
{
*current_call_->resp_ = SerializedMessage(buffer, size);
}
else
{
current_call_->exception_string_ = std::string(reinterpret_cast<char*>(buffer.get()), size);
}
}
- And lastly, calls
callFinished()
which wakes the caller thread:
{
boost::mutex::scoped_lock queue_lock(call_queue_mutex_);
boost::mutex::scoped_lock finished_lock(current_call_->finished_mutex_);
ROS_DEBUG_NAMED("superdebug", "Client to service [%s] call finished with success=[%s]", service_name_.c_str(), current_call_->success_ ? "true" : "false");
current_call_->finished_ = true;
current_call_->finished_condition_.notify_all();
current_call_->call_finished_ = true;
saved_call = current_call_;
current_call_ = CallInfoPtr();
...
}
After that block, the receiver thread no longer has a access to the CallInfoPtr
that the caller thread reads/accesses. But with the timeout, we can no longer ensure that it will happen.
So, all the accesses to the info
after we wake-up from the timeout are currently not protected, as the receiver might be doing work on it. So, cancelCall()
isn't actually safe to call. 🤔
Does this make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing to consider: let's say the call takes forever and times out for the caller, so our applications calls it again. While we're waiting for the second service call, the first call returns. Since resp_ is valid again, we might get a service response from the previous call.
Do we think that's a problem?
I don't think resp_ will be valid. Per request we have:
CallInfoPtr info(boost::make_shared<CallInfo>());
info->req_ = req;
info->resp_ = &resp;
info->success_ = false;
info->finished_ = false;
info->call_finished_ = false;
info->caller_thread_id_ = boost::this_thread::get_id();
so, the receiver thread will have the old CallInfoPtr (the pointer from something we let go, since it copies it?). We will however probably wake up once it signals the condition variable. But it will basically have success = false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also setting resp_
explicitly to nullptr
in cancelCall
and then checking for that value before I deserialise the message into it.
0354ba0
to
3480213
Compare
OK, after chatting with Carlos a bunch offline, I made things more robust by handling the case where we time out, but the server has crashed or restarted. Only the server sending data back to us could progress us to the next service call locally, so even though we weren't blocking, we were just building up our call queue endlessly in that scenario. I am going to just let this run all day. I have one server and seven clients. Each client is running on a loop of service calls, most of which have timeouts, and one which doesn't. I occasionally kill the server (this will cause spam in the test nodes) and bring it back. Screencast.from.31-07-25.14.03.32.webm |
As before, failing tests are unrelated. |
@carlos-m159 what's you opinion on this one? Worth merging, or too dangerous/not enough benefit? |
(Obviously, I'd get more reviews first) |
https://locusrobotics.atlassian.net/browse/RST-13741
Will add tests and get a bit more of the doxygen in place.