Skip to content

Commit 0db57a7

Browse files
committed
Add connection storms documentation
Sole goal of `ShardConnectionBackoffPolicy` existance is to fight connection storms. So, this commit adds connection storms section to `docs/faq.rst`
1 parent 04e5eda commit 0db57a7

File tree

1 file changed

+55
-0
lines changed

1 file changed

+55
-0
lines changed

docs/faq.rst

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,3 +81,58 @@ be specialized per statement by setting :attr:`.Statement.retry_policy`.
8181
Retries are presently attempted on the same coordinator, but this may change in the future.
8282
8383
Please see :class:`.policies.RetryPolicy` for further details.
84+
85+
How to fight connection storms ?
86+
--------------------------------
87+
88+
Scylla employs a shard-per-core architecture to efficiently utilize CPU, memory, and cache.
89+
Data and load are effectively sharded not only across hosts but also across individual shards.
90+
91+
Every ScyllaDB driver routes queries to a specific shard responsible for the requested data.
92+
This eliminates the need for routing logic on the server side.
93+
To support this, the driver maintains a separate connection to every shard in the cluster.
94+
95+
The downside of this approach is the total number of connections that each host in the cluster must handle.
96+
Typically, the number of connections on a given node can be calculated as:
97+
number of clients × number of shards.
98+
99+
For example, a node with 64 shards and 1,000 clients would handle 64,000 connections.
100+
In production clusters, the number of clients can reach hundreds of thousands, and nodes may need to handle up to 2 million connections.
101+
Once established, these connections consume very few resources.
102+
103+
However, when nodes or clients are restarted, all these connections must be re-established.
104+
The process of establishing a new connection involves several steps:
105+
106+
1. Establishing a TCP/TLS connection
107+
2. Protocol negotiation
108+
3. Authentication
109+
4. Running discovery queries: `SELECT * FROM system.local` and `SELECT * FROM system.peers`
110+
111+
If any of these steps fail, the connection is dropped, and a new attempt is made to connect to the same shard, restarting the process.
112+
113+
When a large number of clients attempt to open connections to the same node simultaneously, it can lead to:
114+
115+
1. CPU consumption spikes
116+
2. Latency spikes
117+
3. Service unavailability
118+
4. Clients failing to create and initialize connections
119+
120+
To avoid these problems, it is important to limit the rate at which clients establish connections to nodes.
121+
122+
For this purpose, we introduced the `ShardConnectionBackoffPolicy` specifically, the `LimitedConcurrencyShardConnectionBackoffPolicy`.
123+
This policy does two things:
124+
125+
1. Introduces a backoff delay between each connection the driver creates to a host
126+
2. Limits the number of pending connection attempts
127+
128+
For example, with a configuration allowing only `1` pending connection and a backoff of `0.1` seconds:
129+
130+
.. code-block:: python
131+
132+
cluster = Cluster(
133+
shard_connection_backoff_policy = LimitedConcurrencyShardConnectionBackoffPolicy(
134+
backoff_policy=ConstantShardConnectionBackoffSchedule(0.1),
135+
max_concurrent=1,
136+
)
137+
)
138+

0 commit comments

Comments
 (0)