`ThreadPoolExecutor` have tendencies to block tasks

The bug was discovered during the run of local SCT test with AWS backend: https://github.com/scylladb/scylla-cluster-tests#run-test-locally-with-aws-backend.

It could be used to reproduce the issue, but I wouldn't recommend that since the test runs for at least 15 minutes until crash.
I prepared a much simpler (but analogous) reproducer which can be run locally using docker.

# How to reproduce
## Setup 2-node scylla cluster
Use a `compose.yml` file with the following contents:

```yaml
version: '3'

services:
  some-scylla:
    image: scylladb/scylla
    container_name: some-scylla
    ports:
      - 127.0.0.1:9042:9042

  some-scylla2:
    image: scylladb/scylla
    container_name: some-scylla2
    ports:
      - 127.0.0.2:9042:9042
    command: --seeds=some-scylla
```
Simply run `docker compose up -d` and wait until cluster is up and running. Notice that we only map port 9042 (non-shard-aware CQL).

## Drop packets with destination port 19042 (CQL shard-aware port)
Before my recent [PR](https://github.com/scylladb/scylla-cluster-tests/pull/6824)  to SCT, the AWS cluster security groups wouldn't expose 19042 port. This resulted in connect timeouts (SCT sets `connect_timeout` to 100s) in python driver. We can simulate such behaviour locally with:
```bash
sudo iptables -A INPUT -p tcp --destination-port 19042 -j DROP
```

## Run simple scenario with python driver
Contents of the script (`reproducer.py`):
```python
#!/usr/bin/env python

import logging
import sys

root = logging.getLogger()
root.setLevel(logging.INFO)

handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
root.addHandler(handler)

import cassandra.cluster

# Set connect_timeout to 10 seconds so both of the workers are occupied
# for such duration when trying to connect to shard-aware ports.
cluster = cassandra.cluster.Cluster(['127.0.0.1', '127.0.0.2'], connect_timeout=10)
session = cluster.connect()

# Execute the query with a timeout < connect_timeout.
session.execute("ALTER KEYSPACE system_auth WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }", timeout=3)
```

Run the script:
```
chmod +x ./reproducer.py && ./reproducer.py
```

## (cleanup) Remember to remove iptables entry
```
sudo iptables -D INPUT -p tcp --destination-port 19042 -j DROP
```

# Issue description
 
The script defines a cluster with two contact points (defined in the `compose.yml` file).
It connects (with `connect_timeout=10s`) to the cluster, and then tries to execute some query with
the `timeout < connect_timeout`.

What we get is an unhandled exception:
```
Traceback (most recent call last):
  File "/home/mikolajuzarski/python_playground/reproduce_hang/reproducer.py", line 23, in <module>
    session.execute("ALTER KEYSPACE system_auth WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }", timeout=3)
  File "cassandra/cluster.py", line 2708, in cassandra.cluster.Session.execute
  File "cassandra/cluster.py", line 5048, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'Connection defunct by heartbeat': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=172.23.0.2:9042
```

However, the error message is misleading, because it suggests that something bad happened during query execution.

What really happens is that during `session.connect()` the driver tries to open connections to specific shards of the nodes
using port 19042. It delegates the tasks to `ThreadPoolExecutor` which by default uses [2 workers](https://github.com/scylladb/python-driver/blob/master/cassandra/cluster.py#L1124).

Since all packets with dst port = 19042 are silently dropped, both workers are occupied for 10 seconds (`connect_timeout`). Meanwhile, we asynchronously try to execute the query, but there are no workers available to handle it - the query times out and the program crashes with unhandled exception.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`ThreadPoolExecutor` have tendencies to block tasks #275

How to reproduce

Setup 2-node scylla cluster

Drop packets with destination port 19042 (CQL shard-aware port)

Run simple scenario with python driver

(cleanup) Remember to remove iptables entry

Issue description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ThreadPoolExecutor have tendencies to block tasks #275

Description

How to reproduce

Setup 2-node scylla cluster

Drop packets with destination port 19042 (CQL shard-aware port)

Run simple scenario with python driver

(cleanup) Remember to remove iptables entry

Issue description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`ThreadPoolExecutor` have tendencies to block tasks #275