Skip to content

RST-13741 Adding service client timeouts #48

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: locus-noetic-devel
Choose a base branch
from

Conversation

ayrton04
Copy link
Member

https://locusrobotics.atlassian.net/browse/RST-13741

Will add tests and get a bit more of the doxygen in place.

@ayrton04
Copy link
Member Author

Screencast.from.29-07-25.20.44.44.webm

@ayrton04
Copy link
Member Author

The failing tests are not due to this PR. I reverted my branch to locus-noetic-devel, and I get the same failures. I'll fix them anyway.

@ayrton04 ayrton04 force-pushed the RST-13741-service-client-timeout branch from 945a697 to a9a8271 Compare July 30, 2025 09:45
this,
current_call,
timeout_sec_));
}
Copy link

@carlos-m159 carlos-m159 Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need this thread? Couldn't we use the caller thread to timeout?

As in:


  if (immediate)
  {
    processNextCall();
  }

  auto status = boost::cv_status::no_timeout;
  {
    boost::mutex::scoped_lock lock(info->finished_mutex_);

    while (!info->finished_) 
    {
      if (timeout_sec_ >= 0) 
      {
        status = info->finished_condition_.wait_for(
            lock, boost::chrono::duration<double>(timeout_sec_));
        if (status == boost::cv_status::timeout) 
        {
          bool finished = info->finished_;
          lock.unlock();
          if (info && !finished) {
            // It should be safe to call cancel from this this thread. 
            cancelCall(info);
          }
        }
      } else {
        info->finished_condition_.wait(lock);
      }
    }
  }

  info->call_finished_ = true;

  if (status == boost::cv_status::timeout)
  {
    // If we timeout, we need to cancel the call
    ROS_WARN_STREAM("Service call to " << service_name_ << " timed out after "
                                       << timeout_sec_ << " seconds");
   return false;
  }
  if (info->exception_string_.length() > 0)
  {
    ROS_ERROR("Service call failed: service [%s] responded with an error: %s", service_name_.c_str(), info->exception_string_.c_str());
  }

  return info->success_;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will block us from making another call on that thread, but maybe that's OK?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, never mind, you mean just do this in the call method. Yeah, that makes sense.

if (current_call_->success_)
// If this message was cancelled, the resp_ object will no longer be pointing at a valid response object
// (we reset it to null)
if (current_call_ && current_call_->success_ && current_call_->resp_)
Copy link

@carlos-m159 carlos-m159 Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a small chance that success might be set to true when a time out happens? If the receiver thread is executing. onResponseOkAndLength(). So we might need to set success to false otherwise we receive the output true on the caller side and we access a default constructed response.

Also, I think we might need to block call_queue_mutex_ on the caller thread side. So there is a chance things get written into the current_call_. Or we just return false directly if the service timed out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're saying. I tried changing the return in call() to

return info->success_ && status != boost::cv_status::timeout;

That way, even if onResponseOKAndLength() fires after we time out, we'll still return false here.

Sorry, help me understand the other point with the call_queue_mutex_? Is it something that we've added, or is it a flaw with how the code is anyway?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing to consider: let's say the call takes forever and times out for the caller, so our applications calls it again. While we're waiting for the second service call, the first call returns. Since resp_ is valid again, we might get a service response from the previous call.

Do we think that's a problem?

Copy link

@carlos-m159 carlos-m159 Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other point with the call_queue_mutex_? Is it something that we've added, or is it a flaw with how the code is anyway?

I am not sure if I am reading this correctly but, assuming it does not get stuck:

  • Caller thread enqueues request, and waits for a signal through the condition variable finished_condition_.

  • Once we get a reply, on the receiver thread, it writes the response: writes the success_ while holding call_queue_mutex_

{
    boost::mutex::scoped_lock lock(call_queue_mutex_);
    if ( ok != 0 ) {
    	current_call_->success_ = true;
    } else {
    	current_call_->success_ = false;
    }
}
  • Calls onResponse(), which reads and writes into current_call_, while holding call_queue_mutex_
{
    boost::mutex::scoped_lock queue_lock(call_queue_mutex_);

    // If this message was cancelled, the resp_ object will no longer be pointing at a valid response object
    // (we reset it to null)
    if (current_call_ && current_call_->success_ && current_call_->resp_)
    {
      *current_call_->resp_ = SerializedMessage(buffer, size);
    }
    else
    {
      current_call_->exception_string_ = std::string(reinterpret_cast<char*>(buffer.get()), size);
    }
  }
  • And lastly, calls callFinished() which wakes the caller thread:
{
    boost::mutex::scoped_lock queue_lock(call_queue_mutex_);
    boost::mutex::scoped_lock finished_lock(current_call_->finished_mutex_);

    ROS_DEBUG_NAMED("superdebug", "Client to service [%s] call finished with success=[%s]", service_name_.c_str(), current_call_->success_ ? "true" : "false");

    current_call_->finished_ = true;
    current_call_->finished_condition_.notify_all();
    current_call_->call_finished_ = true;

    saved_call = current_call_;
    current_call_ = CallInfoPtr();
...
}

After that block, the receiver thread no longer has a access to the CallInfoPtr that the caller thread reads/accesses. But with the timeout, we can no longer ensure that it will happen.
So, all the accesses to the info after we wake-up from the timeout are currently not protected, as the receiver might be doing work on it. So, cancelCall() isn't actually safe to call. 🤔

Does this make sense to you?

Copy link

@carlos-m159 carlos-m159 Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing to consider: let's say the call takes forever and times out for the caller, so our applications calls it again. While we're waiting for the second service call, the first call returns. Since resp_ is valid again, we might get a service response from the previous call.
Do we think that's a problem?

I don't think resp_ will be valid. Per request we have:

  CallInfoPtr info(boost::make_shared<CallInfo>());
  info->req_ = req;
  info->resp_ = &resp;
  info->success_ = false;
  info->finished_ = false;
  info->call_finished_ = false;
  info->caller_thread_id_ = boost::this_thread::get_id();

so, the receiver thread will have the old CallInfoPtr (the pointer from something we let go, since it copies it?). We will however probably wake up once it signals the condition variable. But it will basically have success = false?

Copy link
Member Author

@ayrton04 ayrton04 Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also setting resp_ explicitly to nullptr in cancelCall and then checking for that value before I deserialise the message into it.

@ayrton04 ayrton04 force-pushed the RST-13741-service-client-timeout branch from 0354ba0 to 3480213 Compare July 30, 2025 17:01
@ayrton04
Copy link
Member Author

image

Failing tests are all totally unrelated things that have been broken (seemingly) for ages.

@ayrton04
Copy link
Member Author

OK, after chatting with Carlos a bunch offline, I made things more robust by handling the case where we time out, but the server has crashed or restarted. Only the server sending data back to us could progress us to the next service call locally, so even though we weren't blocking, we were just building up our call queue endlessly in that scenario.

I am going to just let this run all day. I have one server and seven clients. Each client is running on a loop of service calls, most of which have timeouts, and one which doesn't. I occasionally kill the server (this will cause spam in the test nodes) and bring it back.

Screencast.from.31-07-25.14.03.32.webm

@ayrton04
Copy link
Member Author

As before, failing tests are unrelated.

@ayrton04
Copy link
Member Author

@carlos-m159 what's you opinion on this one? Worth merging, or too dangerous/not enough benefit?

@ayrton04
Copy link
Member Author

(Obviously, I'd get more reviews first)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants