Add shard connection backoff policy #473

dkropachev · 2025-05-30T08:03:57Z

Introduce ShardReconnectionPolicy and its implementations:

NoDelayShardConnectionBackoffPolicy: no delay or concurrency limit, ensures at most one pending connection per host+shard.
LimitedConcurrencyShardConnectionBackoffPolicy: limits pending concurrent connections to max_concurrent per host with backoff between shard connections.

The idea of this PR is to shift responsibility of scheduling HostConnection._open_connection_to_missing_shard from HostConnection to ShardConnectionBackoffPolicy, that gives ShardConnectionBackoffPolicy control over process of opening connections.

This feature enables finer control over process of creating per shard connections, helping to prevent connections storms.

Fixes: #483

Solutions tested and rejected

Naive delay

Description

Policy would introduce a delay instead of executing connection creation request right away.
Policy would remember last time when connection creation was scheduled to and when it tries to schedule next request it would make sure that there is time between old and new request execution is equal or more than delay it is configured with.

Results

It worked fine when cluster operates in a normal way.

However, during testing with artificial delays, it became clear that this approach breaks down when the time to establish a
connection exceeds the configured delay.
In such cases, connections begin to pile up - the greater the connection initialization time relative to the delay, the faster they accumulate.

This becomes especially problematic during connection storms.
As the cluster becomes overloaded and connection initialization slows down, the delay-based throttling loses its effectiveness. In other words, the more the cluster suffers, the less effective the policy becomes.

Solution

The solution was to give the policy direct control over the connection initialization process.
This allows the policy to track how many connections are currently pending and apply delays after connections are created, rather than before.
That change ensures the policy remains effective even under heavy load.

This behavior is exactly what has been implemented in this PR.

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

mykaul · 2025-06-05T06:57:15Z

Shouldn't we have some warning / info level log when backoff is taking place?

dkropachev · 2025-06-05T10:26:00Z

Shouldn't we have some warning / info level log when backoff is taking place?

I would rather not do it, it is not useful and can potentially pollute the log

Lorak-mmk · 2025-06-06T10:41:09Z

Do you know what caused the test failure?

  =================================== FAILURES ===================================
  ___________________________ TypeTests.test_datetype ____________________________
  
  self = <tests.unit.test_types.TypeTests testMethod=test_datetype>
  
      def test_datetype(self):
          now_time_seconds = time.time()
          now_datetime = datetime.datetime.fromtimestamp(now_time_seconds, tz=datetime.timezone.utc)
      
          # Cassandra timestamps in millis
          now_timestamp = now_time_seconds * 1e3
      
          # same results serialized
  >       self.assertEqual(DateType.serialize(now_datetime, 0), DateType.serialize(now_timestamp, 0))
  E       AssertionError: b'\x00\x00\x01\x97<\x17\xda\xf9' != b'\x00\x00\x01\x97<\x17\xda\xf8'

it is a unit test that at the first glance should be fully deterministic. Failure is unexpected.
From the assertion it looks like some off-by-one error.

dkropachev · 2025-06-06T10:44:03Z

Do you know what caused the test failure?

  =================================== FAILURES ===================================
  ___________________________ TypeTests.test_datetype ____________________________
  
  self = <tests.unit.test_types.TypeTests testMethod=test_datetype>
  
      def test_datetype(self):
          now_time_seconds = time.time()
          now_datetime = datetime.datetime.fromtimestamp(now_time_seconds, tz=datetime.timezone.utc)
      
          # Cassandra timestamps in millis
          now_timestamp = now_time_seconds * 1e3
      
          # same results serialized
  >       self.assertEqual(DateType.serialize(now_datetime, 0), DateType.serialize(now_timestamp, 0))
  E       AssertionError: b'\x00\x00\x01\x97<\x17\xda\xf9' != b'\x00\x00\x01\x97<\x17\xda\xf8'

it is a unit test that at the first glance should be fully deterministic. Failure is unexpected. From the assertion it looks like some off-by-one error.

It is known issue, conversion goes wrong somewhere

cassandra/cluster.py

cassandra/policies.py

Lorak-mmk

General comment: integration tests for new policies are definitely needed here.

cassandra/policies.py

tests/unit/test_policies.py

cassandra/policies.py

cassandra/cluster.py

mykaul · 2025-06-15T11:30:27Z

The patchset lacks documentation, which would have helped to understand the feature and when/how to use it. Is documentation a separate repo / commit?

cassandra/policies.py

dkropachev · 2025-07-03T06:05:58Z

@Lorak-mmk , done, all comments addressed please take a look

Lorak-mmk

It looks much better now, especially documentation-wise!
It would be good to describe this new policy in docs/ if we want people to use it.
Before merging it would be great to run some real-world scenario and see if new policy can help with cluster overload. Is that something that could be done with SCT?

Note: I did not yet read "LimitedConcurrencyShardConnectionBackoffPolicy". I'll have a few more comments there.

cassandra/cluster.py

cassandra/policies.py

tests/unit/test_shard_aware.py

cassandra/policies.py

tests/integration/long/test_policies.py

dkropachev · 2025-07-04T03:37:47Z

It looks much better now, especially documentation-wise! It would be good to describe this new policy in docs/ if we want people to use it.

Done, added section to docs/faq.rst

Before merging it would be great to run some real-world scenario and see if new policy can help with cluster overload. Is that something that could be done with SCT?

There is no python loader there, but we can emulate this issue locally, no need to run it on cloud, only difference is to overload real cluster you need way more connections.

Lorak-mmk · 2025-07-06T13:40:05Z

cassandra/policies.py

+            return schedule, next(schedule)
+        except StopIteration:
+            # A bit of trickery to avoid having lock around self.schedule
+            schedule = self.backoff_policy.new_schedule()
+            delay = next(schedule)
+            self.schedule = schedule
+            return schedule, delay


What is self.schedule? I see no field like this declared in the class, and it doesn't make conceptual sense (function takes schedule as arguments, but in case of error sets it on a field).

forgot to clean up, thanks.

Lorak-mmk · 2025-07-06T13:42:49Z

cassandra/policies.py

+class LimitedConcurrencyShardConnectionBackoffPolicy(ShardConnectionBackoffPolicy):
+    """
+    A shard connection backoff policy that allows only `max_concurrent` concurrent connections per `host_id`.
+
+    For backoff calculation, it requires either a `cassandra.policies.ShardConnectionBackoffSchedule` or
+    a `cassandra.policies.ReconnectionPolicy`, as both expose the same API.
+
+    It spawns threads when there are pending requests, maximum number of threads is `max_concurrent` multiplied by nodes in the cluster.
+    When thread is spawn it initiates backoff schedule, which is local for this thread.
+    If there are no remaining requests for that `host_id`, thread is killed.
+
+    This policy also prevents multiple pending or scheduled connections for the same (host, shard) pair;
+    any duplicate attempts to schedule a connection are silently ignored.
+    """


So this comment is saying about concurrent connections, and spawning threads. As far as I can tell, none of those things are happening here.
Scheduler we are using here for opening connections has 1 thread, so there is no concurrency happening.
The class does not spawn threads anywhere, so idk where this comment comes from.

Got it, it is a bit confusing, changed thread to worker, add more information to _ScopeBucket
_Scheduler has one thread, but it does not run scheduled code, it uses cluster.executor for that, which has 2 threads.

Lorak-mmk · 2025-07-06T13:44:26Z

cassandra/policies.py

+class LimitedConcurrencyShardConnectionBackoffPolicy(ShardConnectionBackoffPolicy):
+    """
+    A shard connection backoff policy that allows only `max_concurrent` concurrent connections per `host_id`.
+
+    For backoff calculation, it requires either a `cassandra.policies.ShardConnectionBackoffSchedule` or
+    a `cassandra.policies.ReconnectionPolicy`, as both expose the same API.
+
+    It spawns threads when there are pending requests, maximum number of threads is `max_concurrent` multiplied by nodes in the cluster.
+    When thread is spawn it initiates backoff schedule, which is local for this thread.
+    If there are no remaining requests for that `host_id`, thread is killed.
+
+    This policy also prevents multiple pending or scheduled connections for the same (host, shard) pair;
+    any duplicate attempts to schedule a connection are silently ignored.
+    """


Actually it is a bit worrying to me that we are now using executor thread for opening new connections.
It already has non-negligible work - handling events, control connections, schema fetches. This also causes all connection opening to be done serially.

How was it done before this PR. Was there a thread-per-connection? Thread-per-host? Just a single thread for everything?

In this regard it used to be done in the exact same way, all the connection creation requests was handled by cluster.executor.
Only difference is that before items were submitted to executor right away, now they are waiting in Scheduler queue according to the schedule

Lorak-mmk · 2025-07-06T13:50:49Z

cassandra/policies.py

+    def _run(self, schedule: Iterator[float]):
+        if self.is_shutdown:
+            return
+
+        with self.lock:
+            try:
+                request = self.items.pop(0)
+            except IndexError:
+                # Just in case
+                if self.currently_pending > 0:
+                    self.currently_pending -= 1
+                # When items are exhausted reset schedule to ensure that new items going to get another schedule
+                # It is important for exponential policy
+                return
+
+        try:
+            request()
+        finally:
+            schedule, delay = self._get_delay(schedule)
+            self.scheduler.schedule(delay, self._run, schedule)
+
+    def schedule_new_connection(self, cb: Callable[[], None]):
+        with self.lock:
+            if self.is_shutdown:
+                return
+            self.items.append(cb)
+            if self.currently_pending < self.max_concurrent:
+                self.currently_pending += 1
+                schedule = self.backoff_policy.new_schedule()
+                delay = next(schedule)
+                self.scheduler.schedule(delay, self._run, schedule)
+


Ok so if I understand correctly, the "concurrency" here is how many pending scheduler.schedule calls can there be. As far as I can tell, it doesn't do anything, since the executor is single thread.

Not exactly, it is how many _run, running or scheduled.
I have just renamed it to _worker_body.

Executor is 2 threaded by default.
But even with 1 threaded executor, while connection is being created, yes it could not run anything, but when it is created and now, when it waits, other instance of _worker_body could be handled, since it is not blocking executor.

Ok I understand those semantics, but I don't really understand how they are useful, what is the intended use case for this? This concurrency mostly means that the sleep times will be different (because there are many "workers"), which is more difficult to reason about than different backoff_policy.

it is useful by two reasons:

It gives concurrency limitation, number of worker define how many connections are handled in given time.

It allows to properly persist backoff schedule, if there is no worker, we would have to either initialize it all the time, or use same backoff schedule. None of these two options is optimal: New schedule does not allow to have exponential backoff; Same schedule will produce unexpected result with exponential backoff and concurrency.

roydahan · 2025-07-06T15:55:08Z

@dkropachev please share test results with and without this feature?

sidenote, Let's make sure we're focusing on the important things.

fruch · 2025-07-08T07:07:33Z

tests/integration/long/test_policies.py


 from tests.integration import use_cluster, get_cluster, get_node, TestCluster


 def setup_module():
+    os.environ['SCYLLA_EXT_OPTS'] = "--smp 8"


This gonna be problematic on github actions, I don't know if you'll enough resources...

even if it would flat fail, it might make those test in this module unstable.

Additionally, test does not revert those changes after finishing, so it will affect other tests that run after. It should save previous value of this env, and restore it later.

This is how ti is done all over the code, let's fix it in separate PR: #504

Existing issues are not a reason for introducing more issues. The fix in this case is simple - in setup_module save the old value of os.environ['SCYLLA_EXT_OPTS'], then in teardown_module restore this value.

fruch · 2025-07-08T07:17:56Z

cassandra/policies.py

+        return _LimitedConcurrencyShardConnectionScheduler(scheduler, self.backoff_policy, self.max_concurrent)
+
+
+class _ScopeBucket:


niptick: the underscore kind of suggest we might use __all__ to show the classes that are public from this module.
as https://peps.python.org/pep-0008/#public-and-internal-interfaces indicate

then it won't be available for tests.
I think if we decide to go with all we need to do that in a separate PR.

fruch

LGTM

the only concern is the smp=8 on integration tests, that might introduce test instability

Commit introduces two abstract classes: 1. `ShardConnectionBackoffPolicy` - a base class for policy that controls pase of shard connections creation 2. Auxiliary `ShardConnectionScheduler` - a scheduler that is instatiated by `ShardConnectionBackoffPolicy` at session initialization

This policy is implementation of ShardConnectionBackoffPolicy. It implements same behavior that driver currently has: 1. No delay between creating shard connections 2. It avoids creating multiple connections to same host_id, shard_id

This is required by upcoming LimitedConcurrencyShardConnectionBackoffPolicy.

There is no reason to accept schedule requests when cluster is shutting down.

Add code that integrates ShardConnectionBackoffPolicy into: 1. Cluster 2. Session 3. HostConnection Main idea is to put ShardConnectionBackoffPolicy in control of shard connection creation proccess. Removing duplicate logic from HostConnection that tracks pending connection creation requests.

This policy is an implementation of `ShardConnectionBackoffPolicy`. Its primary purpose is to prevent connection storms by imposing restrictions on the number of concurrent pending connections per host and backoff time between each connection attempt.

Tests cover: 1. LimitedConcurrencyShardConnectionBackoffPolicy 2. NoDelayShardConnectionBackoffPolicy For both Scylla and Cassandra backend.

Sole goal of `ShardConnectionBackoffPolicy` existance is to fight connection storms. So, this commit adds connection storms section to `docs/faq.rst`

dkropachev force-pushed the dk/add-connection-pool-delay branch 4 times, most recently from 0b80886 to f62dfa3 Compare June 3, 2025 03:42

dkropachev changed the title 1 Add shard-aware reconnection policies with support for scheduling constraints Jun 3, 2025

dkropachev requested a review from Lorak-mmk June 3, 2025 03:45

dkropachev marked this pull request as ready for review June 3, 2025 03:45

dkropachev mentioned this pull request Jun 4, 2025

Delay for per-shard reconnection #483

Open

dkropachev force-pushed the dk/add-connection-pool-delay branch 2 times, most recently from dbb3ad1 to cbb4719 Compare June 4, 2025 17:53

Lorak-mmk requested changes Jun 6, 2025

View reviewed changes

dkropachev force-pushed the dk/add-connection-pool-delay branch 4 times, most recently from a43ccd1 to b0fd069 Compare June 7, 2025 04:47

dkropachev requested a review from Lorak-mmk June 7, 2025 04:48

dkropachev force-pushed the dk/add-connection-pool-delay branch 2 times, most recently from f47313f to 9dfd9ec Compare June 13, 2025 06:20

Lorak-mmk requested changes Jun 13, 2025

View reviewed changes

dkropachev force-pushed the dk/add-connection-pool-delay branch 2 times, most recently from aebc540 to 61668de Compare June 13, 2025 17:58

dkropachev requested a review from Lorak-mmk June 13, 2025 18:02

dkropachev self-assigned this Jun 13, 2025

mykaul reviewed Jun 15, 2025

View reviewed changes

cassandra/policies.py Outdated Show resolved Hide resolved

mykaul reviewed Jun 15, 2025

View reviewed changes

cassandra/policies.py Outdated Show resolved Hide resolved

dkropachev requested a review from mykaul July 3, 2025 06:06

dkropachev force-pushed the dk/add-connection-pool-delay branch from 40dc7b6 to 3d97ecd Compare July 3, 2025 06:38

Lorak-mmk requested changes Jul 3, 2025

View reviewed changes

dkropachev force-pushed the dk/add-connection-pool-delay branch 2 times, most recently from 06f19e3 to f71e7c9 Compare July 3, 2025 23:53

dkropachev requested a review from Lorak-mmk July 3, 2025 23:53

dkropachev force-pushed the dk/add-connection-pool-delay branch 4 times, most recently from 41b5ea8 to 088053b Compare July 4, 2025 14:08

Lorak-mmk requested changes Jul 6, 2025

View reviewed changes

dkropachev force-pushed the dk/add-connection-pool-delay branch from 088053b to 9482d67 Compare July 6, 2025 17:14

dkropachev requested a review from Lorak-mmk July 6, 2025 17:14

dkropachev force-pushed the dk/add-connection-pool-delay branch from 9482d67 to 0db57a7 Compare July 6, 2025 18:38

dkropachev mentioned this pull request Jul 8, 2025

Fix dead locks in connection pool #499

Merged

8 tasks

fruch reviewed Jul 8, 2025

View reviewed changes

fruch approved these changes Jul 8, 2025

View reviewed changes

dkropachev added 8 commits July 15, 2025 20:55

Introduce NoDelayShardConnectionBackoffPolicy

a5f88c6

This policy is implementation of ShardConnectionBackoffPolicy. It implements same behavior that driver currently has: 1. No delay between creating shard connections 2. It avoids creating multiple connections to same host_id, shard_id

Make _Scheduler submit right away if delay is 0

d27a332

This is required by upcoming LimitedConcurrencyShardConnectionBackoffPolicy.

Make _Scheduler that has been shutdown ignore schedule requests

b0388f7

There is no reason to accept schedule requests when cluster is shutting down.

Implementa integration tests for shard connection backof policies

1165717

Tests cover: 1. LimitedConcurrencyShardConnectionBackoffPolicy 2. NoDelayShardConnectionBackoffPolicy For both Scylla and Cassandra backend.

Add connection storms documentation

99f4a90

Sole goal of `ShardConnectionBackoffPolicy` existance is to fight connection storms. So, this commit adds connection storms section to `docs/faq.rst`

dkropachev force-pushed the dk/add-connection-pool-delay branch from 0db57a7 to 99f4a90 Compare July 16, 2025 00:58

		return _LimitedConcurrencyShardConnectionScheduler(scheduler, self.backoff_policy, self.max_concurrent)


		class _ScopeBucket:

Add shard connection backoff policy #473

Are you sure you want to change the base?

Add shard connection backoff policy #473

Conversation

dkropachev commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Solutions tested and rejected

Naive delay

Description

Results

Solution

Pre-review checklist

Uh oh!

mykaul commented Jun 5, 2025

Uh oh!

dkropachev commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lorak-mmk commented Jun 6, 2025

Uh oh!

dkropachev commented Jun 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lorak-mmk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mykaul commented Jun 15, 2025

Uh oh!

Uh oh!

Uh oh!

dkropachev commented Jul 3, 2025

Uh oh!

Lorak-mmk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dkropachev commented Jul 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkropachev Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dkropachev commented May 30, 2025 •

edited

Loading

dkropachev commented Jun 5, 2025 •

edited

Loading

dkropachev Jul 6, 2025 •

edited

Loading