Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adaptive lookback for monovertex #2373

Merged
merged 11 commits into from
Feb 1, 2025
Merged

Conversation

kohlisid
Copy link
Contributor

@kohlisid kohlisid commented Jan 28, 2025

This build on top of our current rater to derive relevant information and calculate the required lookback window

This would encapsulate the lookback for two scenarios

  1. Slow processing vertex
  2. Slow data source - data arrives after long intervals

lookbackSeconds - How many seconds to lookback for vertex average processing rate (tps) and pending messages calculation, defaults to 120. Rate and pending messages metrics are critical for autoscaling, you might need to tune this parameter a bit to see better results. For example, your data source only have 1 minute data input in every 5 minutes, and you don't want the vertices to be scaled down to 0. In this case, you need to increase lookbackSeconds to overlap 5 minutes, so that the calculated average rate and pending messages won't be 0 during the silent period, in order to prevent from scaling down to 0.

https://numaflow.numaproj.io/user-guide/reference/autoscaling/#numaflow-autoscaling

Follow up work

Operational Flow:
Data Entry: Pods report their processed message counts periodically, which are saved into a TimestampedCounts structure and pushed onto a queue.

Lookback Adjustment Process:

  1. The system periodically triggers a routine to review recent data entries.
  2. The routine calculates the maximum duration for which any pod's message count remains unchanged, using a function called CalculateMaxLookback.
  3. Based on this calculation, another function, updateDynamicLookbackSecs, decides if the current lookback period needs adjustment to better align with observed pod activity.

When the value for a pod metric changes, new data is read

  1. This occurs when the source has new data
  2. The processing has completed, allowing more data to be read.

Signed-off-by: Sidhant Kohli <[email protected]>
@kohlisid kohlisid requested a review from KeranYang January 29, 2025 17:47
@kohlisid kohlisid marked this pull request as ready for review January 29, 2025 18:42
Copy link

codecov bot commented Jan 29, 2025

Codecov Report

Attention: Patch coverage is 62.96296% with 40 lines in your changes missing coverage. Please review.

Project coverage is 69.68%. Comparing base (8e9bafb) to head (058417e).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/mvtxdaemon/server/service/rater/rater.go 31.03% 40 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2373      +/-   ##
==========================================
- Coverage   69.84%   69.68%   -0.16%     
==========================================
  Files         361      361              
  Lines       49935    50040     +105     
==========================================
- Hits        34878    34872       -6     
- Misses      13979    14095     +116     
+ Partials     1078     1073       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Sidhant Kohli <[email protected]>
Signed-off-by: Sidhant Kohli <[email protected]>
@kohlisid kohlisid requested a review from whynowy January 30, 2025 01:53
Signed-off-by: Sidhant Kohli <[email protected]>
Signed-off-by: Sidhant Kohli <[email protected]>
@kohlisid kohlisid requested a review from KeranYang January 31, 2025 01:16
@yhl25 yhl25 added this to the 1.5 milestone Jan 31, 2025
Copy link
Member

@whynowy whynowy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KeranYang - please review.

Comment on lines 138 to 144
lastSeen := make(map[string]struct {
count float64
seenTime int64
})

// Map to store the maximum duration for which the value of any pod was unchanged.
maxDuration := make(map[string]int64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was to make the code easier to understand by changing the method to something like below

maxUnchangedDuration = make(map[string]int64)

for pod in pods:
   maxUnchangedDuration[podName] = calculateMaxUnchangedDurationForPod(pod)
  
globalMaxSecs =    maxUnchangedDuration.theMaxDuration().

return globalMaxSecs

The calculateMaxUnchangedDurationForPod method doesn't need to maintain pod name to lastSeen/maxDuration mapping as we do right now.

@kohlisid kohlisid requested a review from KeranYang February 1, 2025 08:58
Signed-off-by: Sidhant Kohli <[email protected]>
@vigith vigith merged commit afc16ac into numaproj:main Feb 1, 2025
25 checks passed
@kohlisid kohlisid deleted the new-adapt branch February 3, 2025 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants