Skip to content
This repository was archived by the owner on May 15, 2025. It is now read-only.

Prefix Aware Scorer #48

Merged
merged 8 commits into from
May 5, 2025
Merged

Prefix Aware Scorer #48

merged 8 commits into from
May 5, 2025

Conversation

oglok
Copy link

@oglok oglok commented Apr 23, 2025

Co-authored with @vMaroon

This PR implements:

  • PrefixStore with an LRU cache-map based data-structure
  • Prefix aware scorer
  • Unit tests

@oglok oglok changed the title WIP: Prefix Aware Scorer Prefix Aware Scorer Apr 23, 2025
@rootfs rootfs requested a review from Copilot April 23, 2025 12:55
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a new prefix store based on a radix tree and a prefix-aware scorer to route requests more efficiently. Key changes include:

  • Adding a PrefixStore and its configuration to scheduler configuration with environmental overrides.
  • Implementing a new PrefixAwareScorer and integrating it into the scheduler.
  • Providing unit tests for the new store and scorer components.

Reviewed Changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/epp/scheduling/types/types.go Added a Prompt field to the LLMRequest structure.
pkg/epp/scheduling/scheduler.go Introduced prefix store configuration and integrated the new scorer.
pkg/epp/scheduling/prefix_store.go Implements the prefix store with entry addition, lookup, eviction, and maintenance.
pkg/epp/scheduling/prefix_store_test.go Unit tests for basic prefix operations, constraints, TTL expiration, max entries, and maintenance routine.
pkg/epp/scheduling/prefix_aware_scorer.go Defines the scorer which uses the prefix store to score pods.
pkg/epp/scheduling/prefix_aware_scorer_test.go Tests behavior of the new scorer with various input scenarios.
Files not reviewed (1)
  • go.mod: Language not supported

Comment on lines 118 to 120
// Update LastUsed time for the matched entry
entry.LastUsed = time.Now()
ps.tree.Insert(matchedPrefix, entry)
Copy link
Preview

Copilot AI Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying 'entry.LastUsed' under a read lock (RLock) in FindPodForPrefix may lead to a race condition. Consider acquiring the write lock (Lock) when updating mutable state or restructuring the update to ensure thread safety.

Suggested change
// Update LastUsed time for the matched entry
entry.LastUsed = time.Now()
ps.tree.Insert(matchedPrefix, entry)
// Update LastUsed time for the matched entry
ps.mu.RUnlock() // Release the read lock before acquiring the write lock
ps.mu.Lock()
entry.LastUsed = time.Now()
ps.tree.Insert(matchedPrefix, entry)
ps.mu.Unlock()

Copilot uses AI. Check for mistakes.

@@ -43,6 +48,10 @@ const (
defaultQueueThresholdCritical = 5
defaultQueueingThresholdLoRA = 128
defaultLoraAffinityThreshold = 0.999
defaultPrefixStoreMaxEntries = 1000
defaultPrefixStoreMinLen = 3
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably too small :D

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops! testing defaults xD

defaultPrefixStoreMaxEntries = 1000
defaultPrefixStoreMinLen = 3
defaultPrefixStoreMaxLen = 100
defaultPrefixStoreTTLHours = 24
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minute instead or hour?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure what should be a reasonable TTL for LLM inferencing.

@vMaroon
Copy link
Member

vMaroon commented Apr 23, 2025

The dev branch was rebased on upstream main, which includes the new and different Scorer interface. The Scorer pulled from there defines a Score function that works per-pod and not group of pods. We need to discuss whether to propose changing that but it makes sense to first adapt to it.

I'll open a KVCacheAwareScorer PR soon and we can sync.

@oglok oglok force-pushed the prefix_scorer branch 2 times, most recently from 7383242 to 037fadd Compare April 26, 2025 09:30
// between the request's prompt and stored prefixes. The score is normalized between 0 and 1,
// where 1 represents the longest matching prefix.
type PrefixAwareScorer struct {
weight float64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, weights are extracted to the layer above.

}

// NewPrefixAwareScorer creates a new PrefixAwareScorer with the given weight and prefix store
func NewPrefixAwareScorer(weight float64, prefixStore *PrefixStore) Scorer {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for receiving (instead of creating internally) the prefixStore?
Seems like this would be an internal implementation decision?
If so, PrefixStore is likely not an exported type.

if !found {
logger.V(logging.DEBUG).Info("No matching prefix found, returning zero scores for all pods")
// If no matching prefix found, return zero scores for all pods
for i, pod := range pods {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider moving this up to the function initialization?
That way, you'd be assured that you start with 0s for all Pods and would not need to repeat it multiple times.


// PrefixEntry represents a single entry in the prefix store
type PrefixEntry struct {
PodRef types.NamespacedName
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: is this how defined in PodMetrics.Pod or would a simple string suffice?

)

// PrefixEntry represents a single entry in the prefix store
type PrefixEntry struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: should the struct and its fields be exported?
On first glance seems that it should be internal use.

errutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/error"
"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging"
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to have a high level description of data structure and algorithms, and tie those to the processing steps of the LLM request (e.g., consulted in read only or also updated on prompt request, are responses handed, ...)

if entry.PodRef == pod && entry.ModelName == modelName {
logger.V(logging.DEBUG).Info("Updating existing entry", "prefix", prefix, "pod", pod.String())
entry.LastUsed = time.Now()
ps.tree.Insert(prefix, entry)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this allow multiple hits on a prefix (e.g., both pods A and B hold the prefix) or just one (last writer wins)?
I think we would want to have multiple (e.g., in case other scorers score these Pods differently).


// Check total entries limit
if ps.tree.Len() >= ps.config.MaxEntries {
logger.V(logging.DEBUG).Info("Store at capacity, evicting oldest entry", "currentSize", ps.tree.Len(), "maxSize", ps.config.MaxEntries)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: would any entries "below" the evicted node become dangling following the eviction?

if len(prefix) < ps.config.MinPrefixLen {
logger.V(logging.DEBUG).Info("Prefix too short", "prefix", prefix, "minLength", ps.config.MinPrefixLen)
return types.NamespacedName{}, false
}

if len(prefix) > ps.config.MaxPrefixLen {
logger.V(logging.DEBUG).Info("Truncating prefix", "originalLength", len(prefix), "maxLength", ps.config.MaxPrefixLen)
prefix = prefix[:ps.config.MaxPrefixLen]
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider extracting to an isValidPrompt function that can be called from multiple places

}

// Use LongestPrefix to find the best match
matchedPrefix, val, found := ps.tree.LongestPrefix(prefix)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation seems to score only the longest prefix and not any of the other pods?

@elevran
Copy link
Collaborator

elevran commented Apr 27, 2025

@oglok would be great if you can rebase off latest dev for easier review. Most of the 78 modified files are coming in from different PRs and "polluting" your PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there downsides to having this feature (including scorer, store and tests) in its own directory?
That would minimize the changes in the existing code base

}
}

// ScoreTargets does the actual scoring of the target pods by the session affinity.
Copy link
Collaborator

@elevran elevran Apr 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear where/when the store is updated with new prompt requests and replies.
Is this currently handled?

@oglok
Copy link
Author

oglok commented Apr 28, 2025

Hey @elevran ! thanks for reviewing the PR. I'm working on the rebase, and rework my code because apparently the scorers interface merged upstream has changed. I'll get your comments in while doing that!

@oglok oglok force-pushed the prefix_scorer branch 2 times, most recently from b3a2149 to 815a432 Compare April 30, 2025 15:41
@vMaroon
Copy link
Member

vMaroon commented May 4, 2025

This PR #118 is based on this work. Feel free to rebase on it so that we can merge both or move it here and merge here. It preserves your contributions.

@oglok oglok force-pushed the prefix_scorer branch from 815a432 to c5fd826 Compare May 5, 2025 08:13
@oglok
Copy link
Author

oglok commented May 5, 2025

I've added @vMaroon 's work here too and changed last commit message.

@oglok oglok force-pushed the prefix_scorer branch 2 times, most recently from 0edd1eb to f3c00aa Compare May 5, 2025 08:16
@oglok oglok force-pushed the prefix_scorer branch from 57c119d to a0e02c0 Compare May 5, 2025 16:39
@vMaroon vMaroon merged commit 09f7448 into neuralmagic:dev May 5, 2025
1 check passed
@oglok oglok mentioned this pull request May 7, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants