Skip to content

use current node load estimation when placing search jobs#6390

Open
trinity-1686a wants to merge 8 commits into
mainfrom
trinity.pointard/placer-consider-load
Open

use current node load estimation when placing search jobs#6390
trinity-1686a wants to merge 8 commits into
mainfrom
trinity.pointard/placer-consider-load

Conversation

@trinity-1686a
Copy link
Copy Markdown
Contributor

@trinity-1686a trinity-1686a commented May 6, 2026

When placing jobs, first query all nodes for their current load, and bias placement toward less loaded nodes to even load on the cluster
this improves on a problem where some nodes might be overloaded while other are underloaded, causing queueing despite not all nodes being at max capacity
in testing, this was seen improving slightly p50+ under constant light load, and increased the max qps before latency explodes. Metrics also showed all searcher busy when some would be only part-time working before

future improvements:

  • we could debounce calls to GetLoad
  • fetch_docs could reuse the same searcher as used in the leaf_search phase, this guarantees a footer-cache hit and respect pre-existing load without the need for more GetLoad calls on the critical path

fn compute_split_cost(split_metadata: &SplitMetadata) -> usize {
pub(crate) fn compute_split_cost(num_docs: u64) -> usize {
// TODO this formula could be tuned a lot more. The general idea is that there is a fixed
// cost to searching a split, plus a somewhat-linear cost depending on the size of the split
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should include if we have aggregations or not, or ideally a cost from the query

// Seed each candidate node with its current load so the placer avoids
// routing work to already-loaded nodes. If a node fails to report its
// load (error or timeout), `load` stays `None`: we still route work
// there if all other nodes are overloaded, but we prefer reachable
Copy link
Copy Markdown
Collaborator

@PSeitz PSeitz May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should error instead if all nodes are not reachable or overloaded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants