Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Correct way to handle TimeoutError due to EAGAIN from Redis::subscriber? #327

Closed
steve-lorimer opened this issue Feb 22, 2022 · 11 comments

Comments

@steve-lorimer
Copy link

Thanks for a great package!

Describe the problem

Attempting to obtain a Subscriber from a Redis connection fails with a TimeoutError exception being thrown, with the description being "Resource temporarily unavailable" (ie: errno EAGAIN)

terminate called after throwing an instance of 'sw::redis::TimeoutError'
  what():  Failed to get reply: Resource temporarily unavailable

Thread 2 "sg_redis" received signal SIGABRT, Aborted.

Stacktrace shows this is from Connection::recv as a result of Connection::_auth during call to Redis::subscriber

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7bda859 in __GI_abort () at abort.c:79
#2  0x0000000000425819 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x000000000066582a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#4  0x0000000000665895 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:58
#5  0x00000000006659e8 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x724160 <typeinfo for sw::redis::TimeoutError>, dest=0x633a20 <sw::redis::TimeoutError::~TimeoutError()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:95
#6  0x0000000000420d97 in sw::redis::throw_error(redisContext const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [clone .cold] ()
#7  0x000000000062b3cc in sw::redis::Connection::recv(bool) () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:52
#8  0x000000000062b5de in sw::redis::Connection::_auth() () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:52
#9  0x000000000062b8d9 in sw::redis::Connection::_set_options() () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:52
#10 0x000000000062badd in sw::redis::Connection::Connection(sw::redis::ConnectionOptions const&) () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:52
#11 0x0000000000631034 in sw::redis::ConnectionPool::create() () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:52
#12 0x00000000006349f4 in sw::redis::Redis::subscriber() () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:52

What is the correct way to handle this?

At the moment I'm just attempting to create a subscriber in a loop, and if a TimeoutError exception is thrown, I just try again, up to some arbitrarily picked max number of retries... it seems to work... but it also seems inelegant / a bit of a hack.

sw::redis::Subscriber subscriber(sw::redis::Redis& redis)
{
  for (int i = 0; i < kMaxSubscribeRetries; ++i)
  {
    try
    {
      sw::redis::Subscriber sub{redis.subscriber()};
      sub.subscribe("foo.bar");
      return sub;
    }
    catch (const sw::redis::TimeoutError&)
    {
      LOG_WARNING("timeout creating redis subscriber; sleeping for a bit and trying again");
      std::this_thread::sleep_for(std::chrono::milliseconds(100));
    }
  }
  throw std::runtime_error("failed to create redis subscriber");
}

What is the recommended way for obtaining a Subscriber please?

Environment:

  • OS: ubuntu
  • Compiler: gcc-11.2.0
  • hiredis version: 1.0.0
  • redis-plus-plus version: 1.3.3
@steve-lorimer
Copy link
Author

Hmm, actually the retry-loop didn't help... it still can't connect even after trying multiple times in a loop.

What's really weird is that the binary, when running on a different server, is able to connect fine; and redis-cli is able to connect from the problematic server... something to do with the combination of the server and the binary using redis-plus-plus is causing issues.

@wingunder
Copy link
Contributor

Hi @skebanga,

What is the recommended way for obtaining a Subscriber please?

Please see this post: #307 (comment)
Also read the chapter on Publish/Subscribe in the top level README.md and refer to the example in that chapter.

something to do with the combination of the server and the binary using redis-plus-plus is causing issues.

Try to see if the same libs are being used on all your servers. Run the following command on all your servers and compare the results carefully:

$ ldd your_binary

In order to be absolutely sure that your libs and binaries are all the same on all servers, you could compare their checksums:

$ ldd your_binary |awk -F'(' '{print $1}' |awk -F'=>' '{print $2}' |grep -v '^\s*$' |xargs sha256sum
$ sha256sum your_binary

You probably checked the connection (to make sure you don't have a network issue or a firewall problem) with redis-cli.
If not, please try the following from all servers as well:

$ redis-cli -h <redis_server> -p <redis_port> ping

Regards

@steve-lorimer
Copy link
Author

steve-lorimer commented Feb 22, 2022

Thanks for the comment!

I am statically linking the redis libs, so that's not an issue

# ldd test_redis
        linux-vdso.so.1 (0x00007ffeb956a000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2e50cf5000)
        libgcc_s.so.1 => /usr/local/gcc-11.2.0/lib64/libgcc_s.so.1 (0x00007f2e50cda000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2e50cb7000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2e50ac5000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f2e50e4e000)

I have also managed to boil the issue down to the following, which shows that it is failing on issuing the AUTH command to redis:

#include <sw/redis++/redis++.h>
#include <iostream>
#include <chrono>

int main()
{
  sw::redis::ConnectionOptions opts("tcp://10.10.6.106:6379");
  
  opts.user = "default";
  opts.password = "XXXXXXXX";

  opts.socket_timeout = std::chrono::milliseconds(100);
  
  sw::redis::Redis r{opts}; 
  sw::redis::Subscriber sub{r.subscriber()};

  return 0;
}

It is working with redis_cli, but not with the above test app...

running both through strace I can see the following:

redis_cli:

socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
connect(3, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("10.10.6.106")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=3, events=POLLOUT}], 1, -1)   = 1 ([{fd=3, revents=POLLOUT}])
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(3, F_SETFL, O_RDWR)               = 0
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(3, SOL_TCP, TCP_KEEPIDLE, [15], 4) = 0
setsockopt(3, SOL_TCP, TCP_KEEPINTVL, [5], 4) = 0
setsockopt(3, SOL_TCP, TCP_KEEPCNT, [3], 4) = 0
write(3, "*2\r\n$4\r\nAUTH\r\n$50\r\n4XXXXXXXX"..., 71) = 71
read(3, "+OK\r\n", 16384)               = 5

test_redis:

socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
connect(3, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("10.10.6.106")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=3, events=POLLOUT}], 1, -1)   = 1 ([{fd=3, revents=POLLOUT}])
connect(3, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("10.10.6.106")}, 16) = 0
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(3, F_SETFL, O_RDWR)               = 0
setsockopt(3, SOL_SOCKET, SO_RCVTIMEO_OLD, "\0\0\0\0\0\0\0\0\240\206\1\0\0\0\0\0", 16) = 0
setsockopt(3, SOL_SOCKET, SO_SNDTIMEO_OLD, "\0\0\0\0\0\0\0\0\240\206\1\0\0\0\0\0", 16) = 0
sendto(3, "*2\r\n$4\r\nAUTH\r\n$50\r\n4XXXXXXXX"..., 71, 0, NULL, 0) = 71
recvfrom(3, 0x7ffe9ddbbbd0, 16384, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)

So major differences I can see are:

  • redis-cli only issues a single connect call, whereas redis-plus-plus calls connect twice (presumably because the first returns EINPROGRESS?)
  • redis-cli sets some additional socket options which redis-plus-plus doesn't
  • redis-cli uses write/read on the socker, while redis-plus-plus uses sendto/recvfrom

I'm a bit lost at this point tbh! :(

@steve-lorimer
Copy link
Author

I ran my test program on a server which works, and got the following strace output:

socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
connect(3, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("10.10.6.106")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=3, events=POLLOUT}], 1, -1)   = 1 ([{fd=3, revents=POLLOUT}])
connect(3, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("10.10.6.106")}, 16) = 0
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(3, F_SETFL, O_RDWR)               = 0
setsockopt(3, SOL_SOCKET, SO_RCVTIMEO_OLD, "\0\0\0\0\0\0\0\0\240\206\1\0\0\0\0\0", 16) = 0
setsockopt(3, SOL_SOCKET, SO_SNDTIMEO_OLD, "\0\0\0\0\0\0\0\0\240\206\1\0\0\0\0\0", 16) = 0
sendto(3, "*2\r\n$4\r\nAUTH\r\n$50\r\n4Vg2dTZVyuh2b"..., 71, 0, NULL, 0) = 71
recvfrom(3, "+OK\r\n", 16384, 0, NULL, NULL) = 5

So literally the only difference between the two is that the failing server's recvfrom is:

recvfrom(3, 0x7ffe9ca7aa40, 16384, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)

whereas the working server's recvfrom is:

recvfrom(3, "+OK\r\n", 16384, 0, NULL, NULL) = 5

@steve-lorimer
Copy link
Author

Well - some progress it seems!

The culprit is opts.socket_timeout, and increasing it to a higher value means the recvfrom call succeeds.

I'll admit I'm no redis expert... but would it be fair to say it seems a bit weird that 100ms is too low?

@sewenew
Copy link
Owner

sewenew commented Feb 22, 2022

@skebanga Yes, I think the problem is that socket_timeout is too small for your redis server.

redis-cli by default might not set a timeout, so it does not have the problem. If you set socket_timeout to 0 when using redis-plus-plus, it should work fine too.

You’d better check if your redis server in a healthy status, or the network between your application and redis is not fast enough.

Regards

@steve-lorimer
Copy link
Author

@sewenew thanks for the comment! The only issue with setting a socket_timeout to 0 is that when I enter the consume loop it will never return, and so then I can never check my std::stop_token and break the loop when I want the thread to stop.

  while (!token.stop_requested()) 
  {
    try
    {
      sub.consume(); // without socket_timeout this call never returns if there are no published messages
    }
    catch (const sw::redis::TimeoutError&)
    {
      // by using socket_timeout we can catch the TimeoutError and check the std::stop_token
    }
    catch (const sw::redis::Error& e)
    {
      LOG_ERROR("redis error: ", e.what());
      break;
    }
  }

Ideally there would be a thread-safe way to cause sub.consume() to return, thereby allowing the thread to exit

@Mephalich
Copy link

It looks like same issue I have posted several days ago here (#323).
For now we have a workaround: just publish a fake message to one of the subscribed channels to rotate the loop.

@sewenew it appears some general solution to this will be highly appreciated. 👍

Best regards

@steve-lorimer
Copy link
Author

just publish a fake message to one of the subscribed channels to rotate the loop.

😭

@sewenew
Copy link
Owner

sewenew commented Feb 23, 2022

@skebanga

The only issue with setting a socket_timeout to 0 is that when I enter the consume loop it will never return, and so then I can never check my std::stop_token and break the loop when I want the thread to stop.

You don't need to set it to 0. Your problem is that your network latency is too big, and if the socket_timeout is smaller than it, it cannot do authentication and subscription before consuming messages.

In order to make it work you need to set socket_timeout larger than the network latency.

Ideally there would be a thread-safe way to cause sub.consume() to return, thereby allowing the thread to exit

As @Mephalich mentioned you can send a termination message to the channel. Once consume is called, it blocks until there's a message or timeout reaches.

Regards

@steve-lorimer
Copy link
Author

@sewenew thanks for the comments! Yes, understand the need to set timeout higher than network latency in this case...

However, I think in an ideal world it would be nice to be able to set zero timeout, as in I don't care how long the authentication and subscription takes, just so long as it completes... but also be able to effect a break out of the consume loop when I want to shut down.

Forcing the user to have to set a timeout just so they can exit the loop is awkward...

Anyway, I'll close the issue, because my question has been answered (even if I don't like the answer!) :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants