feat: adaptive lookback for monovertex #2373

kohlisid · 2025-01-28T21:51:09Z

This build on top of our current rater to derive relevant information and calculate the required lookback window

This would encapsulate the lookback for two scenarios

Slow processing vertex
Slow data source - data arrives after long intervals

lookbackSeconds - How many seconds to lookback for vertex average processing rate (tps) and pending messages calculation, defaults to 120. Rate and pending messages metrics are critical for autoscaling, you might need to tune this parameter a bit to see better results. For example, your data source only have 1 minute data input in every 5 minutes, and you don't want the vertices to be scaled down to 0. In this case, you need to increase lookbackSeconds to overlap 5 minutes, so that the calculated average rate and pending messages won't be 0 during the silent period, in order to prevent from scaling down to 0.

https://numaflow.numaproj.io/user-guide/reference/autoscaling/#numaflow-autoscaling

Follow up work

Move the pending calculation and the lag reader to the daemon server
Numaflow should call Source::Pending only for replica-id 0 #2274
Test and confirm functionality with async data movement.

Operational Flow:
Data Entry: Pods report their processed message counts periodically, which are saved into a TimestampedCounts structure and pushed onto a queue.

Lookback Adjustment Process:

The system periodically triggers a routine to review recent data entries.
The routine calculates the maximum duration for which any pod's message count remains unchanged, using a function called CalculateMaxLookback.
Based on this calculation, another function, updateDynamicLookbackSecs, decides if the current lookback period needs adjustment to better align with observed pod activity.

When the value for a pod metric changes, new data is read

This occurs when the source has new data
The processing has completed, allowing more data to be read.

Signed-off-by: Sidhant Kohli <[email protected]>

codecov · 2025-01-29T18:49:31Z

Codecov Report

Attention: Patch coverage is 62.96296% with 40 lines in your changes missing coverage. Please review.

Project coverage is 69.68%. Comparing base (8e9bafb) to head (058417e).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/mvtxdaemon/server/service/rater/rater.go	31.03%	40 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2373      +/-   ##
==========================================
- Coverage   69.84%   69.68%   -0.16%     
==========================================
  Files         361      361              
  Lines       49935    50040     +105     
==========================================
- Hits        34878    34872       -6     
- Misses      13979    14095     +116     
+ Partials     1078     1073       -5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pkg/mvtxdaemon/server/service/rater/rater.go

Signed-off-by: Sidhant Kohli <[email protected]>

pkg/mvtxdaemon/server/service/rater/rater.go

pkg/mvtxdaemon/server/service/rater/helper.go

Signed-off-by: Sidhant Kohli <[email protected]>

pkg/metrics/metrics.go

pkg/mvtxdaemon/server/service/rater/helper.go

pkg/mvtxdaemon/server/service/rater/helper_test.go

pkg/mvtxdaemon/server/service/rater/rater.go

Signed-off-by: Sidhant Kohli <[email protected]>

whynowy

@KeranYang - please review.

docs/user-guide/reference/autoscaling.md

pkg/metrics/metrics.go

pkg/mvtxdaemon/server/service/rater/helper_test.go

KeranYang · 2025-01-31T16:06:04Z

pkg/mvtxdaemon/server/service/rater/helper.go

+	lastSeen := make(map[string]struct {
+		count    float64
+		seenTime int64
+	})
+
+	// Map to store the maximum duration for which the value of any pod was unchanged.
+	maxDuration := make(map[string]int64)


My point was to make the code easier to understand by changing the method to something like below

maxUnchangedDuration = make(map[string]int64) for pod in pods: maxUnchangedDuration[podName] = calculateMaxUnchangedDurationForPod(pod) globalMaxSecs = maxUnchangedDuration.theMaxDuration(). return globalMaxSecs

The calculateMaxUnchangedDurationForPod method doesn't need to maintain pod name to lastSeen/maxDuration mapping as we do right now.

Signed-off-by: Sidhant Kohli <[email protected]>

feat: adaptive lookback

8bf5821

Signed-off-by: Sidhant Kohli <[email protected]>

kohlisid requested a review from KeranYang January 29, 2025 17:47

kohlisid marked this pull request as ready for review January 29, 2025 18:42

kohlisid requested review from whynowy and vigith as code owners January 29, 2025 18:42

Merge branch 'main' into new-adapt

b17a35b

whynowy reviewed Jan 29, 2025

View reviewed changes

pkg/mvtxdaemon/server/service/rater/rater.go Outdated Show resolved Hide resolved

comments

ef6e87b

Signed-off-by: Sidhant Kohli <[email protected]>

whynowy reviewed Jan 29, 2025

View reviewed changes

pkg/mvtxdaemon/server/service/rater/rater.go Show resolved Hide resolved

pkg/mvtxdaemon/server/service/rater/helper.go Outdated Show resolved Hide resolved

pkg/mvtxdaemon/server/service/rater/helper.go Outdated Show resolved Hide resolved

chore: clean

928672c

Signed-off-by: Sidhant Kohli <[email protected]>

kohlisid requested a review from whynowy January 30, 2025 01:53

KeranYang reviewed Jan 30, 2025

View reviewed changes

kohlisid added 2 commits January 30, 2025 15:51

chore: comments

ee555ec

Signed-off-by: Sidhant Kohli <[email protected]>

comments

ea8d5c0

Signed-off-by: Sidhant Kohli <[email protected]>

kohlisid requested a review from KeranYang January 31, 2025 01:16

yhl25 assigned kohlisid Jan 31, 2025

yhl25 added this to the 1.5 milestone Jan 31, 2025

Merge branch 'main' into new-adapt

ac5117e

whynowy approved these changes Jan 31, 2025

View reviewed changes

docs/user-guide/reference/autoscaling.md Outdated Show resolved Hide resolved

KeranYang reviewed Jan 31, 2025

View reviewed changes

kohlisid added 3 commits February 1, 2025 00:57

comments

cb049e2

Signed-off-by: Sidhant Kohli <[email protected]>

fix comment

8685002

Signed-off-by: Sidhant Kohli <[email protected]>

Merge branch 'main' into new-adapt

1ee9a93

kohlisid requested a review from KeranYang February 1, 2025 08:58

fix comment

058417e

Signed-off-by: Sidhant Kohli <[email protected]>

KeranYang approved these changes Feb 1, 2025

View reviewed changes

vigith merged commit afc16ac into numaproj:main Feb 1, 2025
25 checks passed

kohlisid deleted the new-adapt branch February 3, 2025 06:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adaptive lookback for monovertex #2373

feat: adaptive lookback for monovertex #2373

kohlisid commented Jan 28, 2025 •

edited

Loading

codecov bot commented Jan 29, 2025 •

edited

Loading

whynowy left a comment

KeranYang Jan 31, 2025

feat: adaptive lookback for monovertex #2373

feat: adaptive lookback for monovertex #2373

Conversation

kohlisid commented Jan 28, 2025 • edited Loading

codecov bot commented Jan 29, 2025 • edited Loading

Codecov Report

whynowy left a comment

Choose a reason for hiding this comment

KeranYang Jan 31, 2025

Choose a reason for hiding this comment

kohlisid commented Jan 28, 2025 •

edited

Loading

codecov bot commented Jan 29, 2025 •

edited

Loading