Skip to content

Connection doesn't propagate information about being closed to Cluster #345

Open
@Lorak-mmk

Description

@Lorak-mmk

Discovered when investigating https://github.com/scylladb/scylla-dtest/issues/4364

When the node goes down it will close client connections (probably not always? I guess if it dies unexpectedly then it has no way to), and the connections in the driver will notice it. The logs look like this:

18:51:41,609 cassandra.io.libevreactor DEBUG libevreactor.py:373  | Connection <LibevConnection(140694180404560) 127.0.10.1:9042> closed by server
18:51:41,609 cassandra.io.libevreactor DEBUG libevreactor.py:287  | Closing connection (140694180404560) to 127.0.10.1:9042
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:291  | Closed socket to 127.0.10.1:9042
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:373  | Connection <LibevConnection(140694185696976) 127.0.10.1:9042> closed by server
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:287  | Closing connection (140694185696976) to 127.0.10.1:9042
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:291  | Closed socket to 127.0.10.1:9042
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:373  | Connection <LibevConnection(140694185158224) 127.0.10.1:19042> closed by server
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:287  | Closing connection (140694185158224) to 127.0.10.1:19042
18:51:41,611 cassandra.io.libevreactor DEBUG libevreactor.py:291  | Closed socket to 127.0.10.1:19042
18:51:41,611 cassandra.io.libevreactor DEBUG libevreactor.py:373  | Connection <LibevConnection(140694180402832) 127.0.10.1:19042> closed by server
18:51:41,611 cassandra.io.libevreactor DEBUG libevreactor.py:287  | Closing connection (140694180402832) to 127.0.10.1:19042
18:51:41,611 cassandra.io.libevreactor DEBUG libevreactor.py:291  | Closed socket to 127.0.10.1:19042

the problem is that the information about those connections closing is not propagated anywhere: driver still thinks it has fully functioning connection pool - and if dead node was the one driver had control connection opened to, then the driver still thinks it has functioning control connection and waits for events.
Driver will notice that those connections are dead only when it tries to use them - send heartbeat / cql query / refresh schema etc.

This is a problem in the following scenario (this is done in https://github.com/scylladb/scylla-dtest/issues/4364):

  • cluster consists of 2 nodes (but the issue scales for any number of nodes I think)
  • driver has control connection to node 1
  • node 1 is restarted - driver doesn't notice it
  • node 2 is stopped
  • Now driver has no working pools and no control connection (but doesn't know it)
  • When query is executed it will fail: for node 2 because it is down, and for 1 because driver will notice that connection is closed.

What the driver should do is propagate the information from single connection upwards and reopen connections / mark host as down.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtriage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions