-
Notifications
You must be signed in to change notification settings - Fork 49
Description
The bug was discovered during the run of local SCT test with AWS backend: https://github.com/scylladb/scylla-cluster-tests#run-test-locally-with-aws-backend.
It could be used to reproduce the issue, but I wouldn't recommend that since the test runs for at least 15 minutes until crash.
I prepared a much simpler (but analogous) reproducer which can be run locally using docker.
How to reproduce
Setup 2-node scylla cluster
Use a compose.yml
file with the following contents:
version: '3'
services:
some-scylla:
image: scylladb/scylla
container_name: some-scylla
ports:
- 127.0.0.1:9042:9042
some-scylla2:
image: scylladb/scylla
container_name: some-scylla2
ports:
- 127.0.0.2:9042:9042
command: --seeds=some-scylla
Simply run docker compose up -d
and wait until cluster is up and running. Notice that we only map port 9042 (non-shard-aware CQL).
Drop packets with destination port 19042 (CQL shard-aware port)
Before my recent PR to SCT, the AWS cluster security groups wouldn't expose 19042 port. This resulted in connect timeouts (SCT sets connect_timeout
to 100s) in python driver. We can simulate such behaviour locally with:
sudo iptables -A INPUT -p tcp --destination-port 19042 -j DROP
Run simple scenario with python driver
Contents of the script (reproducer.py
):
#!/usr/bin/env python
import logging
import sys
root = logging.getLogger()
root.setLevel(logging.INFO)
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
root.addHandler(handler)
import cassandra.cluster
# Set connect_timeout to 10 seconds so both of the workers are occupied
# for such duration when trying to connect to shard-aware ports.
cluster = cassandra.cluster.Cluster(['127.0.0.1', '127.0.0.2'], connect_timeout=10)
session = cluster.connect()
# Execute the query with a timeout < connect_timeout.
session.execute("ALTER KEYSPACE system_auth WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }", timeout=3)
Run the script:
chmod +x ./reproducer.py && ./reproducer.py
(cleanup) Remember to remove iptables entry
sudo iptables -D INPUT -p tcp --destination-port 19042 -j DROP
Issue description
The script defines a cluster with two contact points (defined in the compose.yml
file).
It connects (with connect_timeout=10s
) to the cluster, and then tries to execute some query with
the timeout < connect_timeout
.
What we get is an unhandled exception:
Traceback (most recent call last):
File "/home/mikolajuzarski/python_playground/reproduce_hang/reproducer.py", line 23, in <module>
session.execute("ALTER KEYSPACE system_auth WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }", timeout=3)
File "cassandra/cluster.py", line 2708, in cassandra.cluster.Session.execute
File "cassandra/cluster.py", line 5048, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'Connection defunct by heartbeat': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=172.23.0.2:9042
However, the error message is misleading, because it suggests that something bad happened during query execution.
What really happens is that during session.connect()
the driver tries to open connections to specific shards of the nodes
using port 19042. It delegates the tasks to ThreadPoolExecutor
which by default uses 2 workers.
Since all packets with dst port = 19042 are silently dropped, both workers are occupied for 10 seconds (connect_timeout
). Meanwhile, we asynchronously try to execute the query, but there are no workers available to handle it - the query times out and the program crashes with unhandled exception.