@@ -81,3 +81,58 @@ be specialized per statement by setting :attr:`.Statement.retry_policy`.
81
81
Retries are presently attempted on the same coordinator, but this may change in the future.
82
82
83
83
Please see :class :`.policies.RetryPolicy` for further details.
84
+
85
+ How to fight connection storms ?
86
+ --------------------------------
87
+
88
+ Scylla employs a shard- per- core architecture to efficiently utilize CPU , memory, and cache.
89
+ Data and load are effectively sharded not only across hosts but also across individual shards.
90
+
91
+ Every ScyllaDB driver routes queries to a specific shard responsible for the requested data.
92
+ This eliminates the need for routing logic on the server side.
93
+ To support this, the driver maintains a separate connection to every shard in the cluster.
94
+
95
+ The downside of this approach is the total number of connections that each host in the cluster must handle.
96
+ Typically, the number of connections on a given node can be calculated as :
97
+ number of clients × number of shards.
98
+
99
+ For example, a node with 64 shards and 1 ,000 clients would handle 64 ,000 connections.
100
+ In production clusters, the number of clients can reach hundreds of thousands, and nodes may need to handle up to 2 million connections.
101
+ Once established, these connections consume very few resources.
102
+
103
+ However, when nodes or clients are restarted, all these connections must be re- established.
104
+ The process of establishing a new connection involves several steps:
105
+
106
+ 1 . Establishing a TCP / TLS connection
107
+ 2 . Protocol negotiation
108
+ 3 . Authentication
109
+ 4 . Running discovery queries: `SELECT * FROM system.local` and `SELECT * FROM system.peers`
110
+
111
+ If any of these steps fail, the connection is dropped, and a new attempt is made to connect to the same shard, restarting the process.
112
+
113
+ When a large number of clients attempt to open connections to the same node simultaneously, it can lead to:
114
+
115
+ 1 . CPU consumption spikes
116
+ 2 . Latency spikes
117
+ 3 . Service unavailability
118
+ 4 . Clients failing to create and initialize connections
119
+
120
+ To avoid these problems, it is important to limit the rate at which clients establish connections to nodes.
121
+
122
+ For this purpose, we introduced the `ShardConnectionBackoffPolicy` specifically, the `LimitedConcurrencyShardConnectionBackoffPolicy` .
123
+ This policy does two things:
124
+
125
+ 1 . Introduces a backoff delay between each connection the driver creates to a host
126
+ 2 . Limits the number of pending connection attempts
127
+
128
+ For example, with a configuration allowing only `1 ` pending connection and a backoff of `0.1 ` seconds:
129
+
130
+ .. code- block:: python
131
+
132
+ cluster = Cluster(
133
+ shard_connection_backoff_policy = LimitedConcurrencyShardConnectionBackoffPolicy(
134
+ backoff_policy = ConstantShardConnectionBackoffSchedule(0.1 ),
135
+ max_concurrent = 1 ,
136
+ )
137
+ )
138
+
0 commit comments