Skip to content

Faster shard scaling #5679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 20, 2025
Merged

Faster shard scaling #5679

merged 5 commits into from
Mar 20, 2025

Conversation

rdettai
Copy link
Collaborator

@rdettai rdettai commented Feb 13, 2025

Description

Indexers return a lot of 429 when ingest rates fluctuate, even if the overall indexing capacity (~5MB/core) is note reached. This PR makes some small parameter adjustments to decrease the overall error rates on typical workloads. It will be the topic for another issue/PR to introduce the structual changes that are necessary to improve the routing strategy.

  • Customizable shard burst limit with default increased to 50MB: helps absorbing the load until the next scale up
  • Shard scaling factor: helps keeping up with load fluctuations on workloads with high number of shards (e.g having +20MB/s on an index where the average ingest rate is 100MB/s is more likely than +20MB/s on an index where the average rate is 1MB/s)

Related to #5270

How was this PR tested?

Adhoc python load testing framework: https://github.com/rdettai/qw-ingest-tests

@rdettai rdettai self-assigned this Feb 13, 2025
@rdettai
Copy link
Collaborator Author

rdettai commented Feb 13, 2025

Test setup:

  • single node cluster
  • apply a continuous load on a new index
  • correlate shard counts with 429 throttling

Observations:

  • there is a high latency between a scaling action and the entry being added to the router (5+s)
    • shard_burst_limit helps absorbing part of this delay, but even with a high burst limit (100MB), a fairly low ingest rate (20MB/s) runs into throttling before the router learns about the new shards
    • when scaling, in the time interval where the new shards are created but not yet added to the router, the load factor calculated is artificially lowered: the new shards are already taken into account in the capacity calculation but not for the load calculation (the requests that could have gone to these shards are still rejected with 429). This sometimes delays the next scale up action.

@rdettai rdettai force-pushed the faster-shard-scaling branch from af3a911 to 3bbfa9e Compare February 13, 2025 19:47
@rdettai rdettai marked this pull request as ready for review February 17, 2025 11:01
@rdettai rdettai requested a review from guilload February 17, 2025 11:01
@rdettai rdettai force-pushed the faster-shard-scaling branch 2 times, most recently from 33e28fe to 4d93d07 Compare February 24, 2025 14:27
@rdettai rdettai requested a review from guilload February 25, 2025 08:36
fn long_term_scale_up_threshold_max_shards(&self, shard_stats: ShardStats) -> usize {
(shard_stats.avg_long_term_ingestion_rate * shard_stats.num_open_shards as f32
/ self.scale_up_shards_long_term_threshold_mib_per_sec)
.floor() as usize
Copy link
Collaborator

@fulmicoton-dd fulmicoton-dd Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the floor? I would have expected ceil here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it was a ceil, it would mean that scaling up by this number of shards would get us slightly below the scale_up_shards_long_term_threshold_mib_per_sec. For small shard counts it might get us a bit to close to scale_down_shards_threshold_mib_per_sec

@rdettai rdettai force-pushed the faster-shard-scaling branch from f185d6a to d74e517 Compare March 20, 2025 09:03
@rdettai rdettai enabled auto-merge (squash) March 20, 2025 09:03
@rdettai rdettai merged commit f10654a into main Mar 20, 2025
8 checks passed
@rdettai rdettai deleted the faster-shard-scaling branch March 20, 2025 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants