Fix ML calendar event update scalability issues #136886

valeriy42 · 2025-10-21T15:09:49Z

Fixes issue where calendar events failed to update some jobs when associated with large numbers of jobs (>1000) due to queue capacity limits and sequential processing.

Problem: UpdateJobProcessNotifier has a 1000-item queue and processes updates sequentially. It uses offer() on the queue, which silently drops updates when the queue is full.

However, calendar/filter updates don't need ordering guarantees. Hence, JobManager.submitJobEventUpdate() can bypass the queue and avoid the bottleneck of the queue size.

Another problem is the "fire-and-forget" pattern: submitJobEventUpdate() returns immediately without waiting for the update to complete. I introduce RefCountingListener to track the calendar updates. We start a background thread that updates the jobs and tracks succeeded, failed, and skipped jobs, while the request is returned immediately to prevent a timeout.

Finally, if the problem with failed job updates persists, I enhanced the logging throughout the system to create a trace for future diagnostics.

Refactor JobManager.submitJobEventUpdate() to bypass UpdateJobProcessNotifier queue
Use RefCountingListener for parallel calendar/filter updates
Add comprehensive logging throughout the system
Create CalendarScalabilityIT integration tests
Add helper methods to base test class

Fixes #129777

- Refactor JobManager.submitJobEventUpdate() to bypass UpdateJobProcessNotifier queue - Use RefCountingListener for parallel calendar/filter updates - Add comprehensive logging throughout the system - Create CalendarScalabilityIT integration tests - Add helper methods to base test class Fixes issue where calendar events failed to update some jobs when associated with large numbers of jobs (>1000) due to queue capacity limits and sequential processing.

elasticsearchmachine · 2025-10-21T15:17:40Z

Hi @valeriy42, I've created a changelog YAML for you.

…g to API calls and processing job updates asynchronously in the background.

…thub.com/valeriy42/elasticsearch into bugfix/limited-update-notification-queue

…e handling in JobManager to include skipped updates. Update logging to reflect skipped updates during background calendar processing.

…hods and updating job creation visibility. Enhance ScheduledEventsIT to verify asynchronous calendar updates and add a plugin for tracking UpdateProcessAction calls.

…the updated logging package. This change improves consistency and aligns with recent codebase updates.

elasticsearchmachine · 2025-10-22T13:54:58Z

Pinging @elastic/ml-core (Team:ML)

…thub.com/valeriy42/elasticsearch into bugfix/limited-update-notification-queue

DonalEvans · 2025-10-27T16:10:02Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/JobManager.java

-        updateListener.onResponse(Boolean.TRUE);
+    private boolean isExpectedFailure(Exception e) {
+        // Job deleted, closed, etc. - not real errors
+        return ExceptionsHelper.unwrapCause(e) instanceof ResourceNotFoundException || e.getMessage().contains("is not open");


Would it be safer to be more explicit with this contains() check, to prevent an error with a similar message getting ignored when it shouldn't be? I think that the full error message should be "Cannot perform requested action because job [" + jobId + "] is not open" so maybe that's what we should check? You could even extract the code in TransportJobTaskAction,doExecute() that creates the error message into a static method and call that here so that this check is guaranteed to always have the correct string.

I extended the check to be more explicit.

DonalEvans · 2025-10-27T16:38:59Z

...de-tests/src/javaRestTest/java/org/elasticsearch/xpack/ml/integration/ScheduledEventsIT.java

+        // Post events and verify API completes quickly (async behavior)
+        long startTime = System.currentTimeMillis();
+        postScheduledEvents(calendarId, events);
+        long duration = System.currentTimeMillis() - startTime;
+
+        assertThat("API should complete quickly with async implementation", duration, lessThan(5000L));


Is there a chance that this might end up being flaky if the machine running the test is overloaded and/or a GC happens while postScheduledEvents() is being called? Can we guarantee that it will never take longer than 5 seconds for postScheduledEvents() to return? Also, if the implementation wasn't async, could we guarantee that it would always take longer than 5 seconds? If not, then maybe we don't need this check since it's not providing much value, or perhaps some other way to differentiate between async and sync implementations could be used?

The assertion will eventually fail in CI and the fix will be to bump the timeout at which point the assertion starts to become meaningless.

I removed the assertion. The real verification is that the ActionFilter captured the call (lines 525-536), and the async behavior is verified by the immediate response rather than waiting for completion.

DonalEvans · 2025-10-27T16:44:10Z

...de-tests/src/javaRestTest/java/org/elasticsearch/xpack/ml/integration/ScheduledEventsIT.java

+                ScheduledEventsIT.UpdateProcessActionTrackerPlugin.updatedJobIds,
+                contains(jobId)
+            );
+        }, 5, TimeUnit.SECONDS);


Does this timeout need to be so small? It seems like something that could easily become flaky if the hardware running the test was overloaded. Unless the code is making a guarantee somewhere that it will always take less than 5 seconds for the action filter to be applied, this should probably be using a longer, default timeout.

Changed to use esing the default assertBusy timeout.

DonalEvans · 2025-10-27T16:51:49Z

...de-tests/src/javaRestTest/java/org/elasticsearch/xpack/ml/integration/ScheduledEventsIT.java

+    /**
+     * Test calendar updates with closed jobs (should not fail)
+     */
+    public void testCalendarUpdateWithClosedJobs() throws IOException {


Would it be worthwhile having a test where there are both closed and non-closed jobs, and confirming that the closed ones are skipped but the non-closed ones are updated?

I added testCalendarUpdateWithMixedOpenAndClosedJobs() to cover this use case.

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/JobManager.java

benwtrent

The overall idea makes sense to me. My concern is that all the new logging is mostly unnecessary and should be debug (especially the one that tracks the time it took to make the change).

...in/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportPostCalendarEventsAction.java

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/JobManager.java

davidkyle

UpdateJobProcessNotifier empties the queue every 1 second. To overflow that queue the cluster must have over 1,000 open jobs or there are multiple updates per job. The latter could happen if multiple calendars are updated in a less than a second.

Wouldn't it be simpler to make the queue in UpdateJobProcessNotifier unbounded or larger? then the behaviour of serialising the updates is preserved rather than firing off all updates at once.

The max number of open jobs on an node is limited by xpack.ml.max_open_jobs, the queue size could be a function of that setting and the number of ml nodes in the cluster + some overhead for multiple updates.

I also see optimisations that could be considered for a follow up PR. UpdateJobProcessNotifier should collapse all the calendar updates for a job as the latest calendar events are used anyway. In the case where 1,000 jobs are updated because a single calendar has changed then the search to get the calendar events is executed 1,000 times- once for each job. It would be better for each ml node to search the latest events then update all the jobs on that node.

davidkyle · 2025-10-30T09:39:40Z

...de-tests/src/javaRestTest/java/org/elasticsearch/xpack/ml/integration/ScheduledEventsIT.java

+        // Post events and verify API completes quickly (async behavior)
+        long startTime = System.currentTimeMillis();
+        postScheduledEvents(calendarId, events);
+        long duration = System.currentTimeMillis() - startTime;
+
+        assertThat("API should complete quickly with async implementation", duration, lessThan(5000L));


The assertion will eventually fail in CI and the fix will be to bump the timeout at which point the assertion starts to become meaningless.

...de-tests/src/javaRestTest/java/org/elasticsearch/xpack/ml/integration/ScheduledEventsIT.java

davidkyle · 2025-10-30T09:59:41Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/JobManager.java

            }
        } else {
-            logger.debug("[{}] No process update required for job update: {}", jobUpdate::getJobId, jobUpdate::toString);
+            logger.debug("[{}] No process update required for job update: {}", jobUpdate.getJobId(), jobUpdate.toString());


Using a supplier means that jobUpdate.toString() won't be evaluated unless debug level logging is enabled. jobUpdate.getJobId() is trivial but jobUpdate.toString() is not. What's the reasoning behind this change?

Hm. I think I changed the logger to use elastic's one and now I have the following error:

error: method debug in interface Logger cannot be applied to given types;
logger.debug("[{}] No process update required for job update: {}", jobUpdate::getJobId, jobUpdate::toString);

But I'll rewrite to prevent premature execution.

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/JobManager.java

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/UpdateJobProcessNotifier.java

…bugfix/limited-update-notification-queue

valeriy42

Thank you for the reviews. I updated code and answered questions.

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/UpdateJobProcessNotifier.java

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/JobManager.java

...in/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportPostCalendarEventsAction.java

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/JobManager.java

valeriy42 · 2025-10-31T09:18:54Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/JobManager.java

-        updateListener.onResponse(Boolean.TRUE);
+    private boolean isExpectedFailure(Exception e) {
+        // Job deleted, closed, etc. - not real errors
+        return ExceptionsHelper.unwrapCause(e) instanceof ResourceNotFoundException || e.getMessage().contains("is not open");


I extended the check to be more explicit.

…roved readability

valeriy42 · 2025-10-31T12:31:44Z

@davidkyle

Wouldn't it be simpler to make the queue in UpdateJobProcessNotifier unbounded or larger? then the behaviour of serialising the updates is preserved rather than firing off all updates at once.

I considered this approach, but bypassing the queue is the better solution for several reasons:

Addresses the root cause, not just symptoms: Making the queue larger (e.g., based on xpack.ml.max_open_jobs) would delay the overflow problem but doesn't fix it. With a global calendar affecting all jobs in a large cluster, we'd eventually hit any bound. The real issue is that calendar updates don't need the sequential processing that the queue provides.
Ordering is unnecessary for calendar updates: The UpdateJobProcessNotifier queue exists to preserve ordering for job config updates that must happen sequentially. Calendar updates, however, are explicitly designed to fetch the latest state from the index, making ordering irrelevant. As the class documentation notes: "for updates to resources the job uses (e.g. calendars, filters), they can be handled on non-master nodes as long as the update process action is fetching the latest version of those resources from the index."
Performance characteristics: Even with an unbounded queue, sequential processing means:
- 1000 jobs = 1000 sequential updates = much longer total time
- Queue overhead (draining, iteration, etc.)
- Single point of failure if the queue processing thread is blocked
Parallel execution is more efficient: Concurrent execution provides:
- All updates start immediately
- Better resource utilization
- Faster completion for large-scale updates
- No queue management overhead
The queue still serves its purpose: Job config updates (which do need ordering) still go through the queue. We're only bypassing it for calendar/filter updates that don't need ordering.

The optimization you mentioned (collapsing calendar updates per job) is a great follow-up idea and would work well with this parallel approach. I'll create an issue to capture it an work on it later.

valeriy42 · 2025-10-31T12:33:00Z

@DonalEvans , @benwtrent , @davidkyle thank you for your comments. I introduced the suggested changed. Looking forward to your new feedback.

davidkyle

LGTM

valeriy42 added >bug v9.3.0 auto-backport Automatically create backport pull requests when merged v8.19.6 v9.1.6 v9.2.1 v8.19.7 v9.1.7 :ml Machine learning and removed v9.1.6 v8.19.6 labels Oct 21, 2025

[CI] Auto commit changes from spotless

6957a46

valeriy42 and others added 10 commits October 21, 2025 17:17

Update docs/changelog/136886.yaml

0311f6c

checkstyle

9eda793

Merge branch 'main' into bugfix/limited-update-notification-queue

defa8c2

[CI] Auto commit changes from spotless

8514752

Progress-Based Response for calendar updates by immediately respondin…

ddd9e53

…g to API calls and processing job updates asynchronously in the background.

Merge branch 'bugfix/limited-update-notification-queue' of https://gi…

10fea2a

…thub.com/valeriy42/elasticsearch into bugfix/limited-update-notification-queue

spotless

771b430

Remove CalendarScalabilityIT integration tests and refactor job updat…

95a07dd

…e handling in JobManager to include skipped updates. Update logging to reflect skipped updates during background calendar processing.

Refactor integration tests for ML job handling by removing unused met…

613bcce

…hods and updating job creation visibility. Enhance ScheduledEventsIT to verify asynchronous calendar updates and add a plugin for tracking UpdateProcessAction calls.

Refactor logging imports in TransportPostCalendarEventsAction to use …

8bbe8af

…the updated logging package. This change improves consistency and aligns with recent codebase updates.

valeriy42 marked this pull request as ready for review October 22, 2025 13:54

elasticsearchmachine added the Team:ML Meta label for the ML team label Oct 22, 2025

davidkyle self-requested a review October 22, 2025 14:18

valeriy42 added 2 commits October 22, 2025 16:46

fix logger check

8361516

Merge branch 'main' into bugfix/limited-update-notification-queue

ed18e77

valeriy42 added 3 commits October 23, 2025 10:56

Update unit tests.

6ea309f

Merge branch 'bugfix/limited-update-notification-queue' of https://gi…

8edc65a

…thub.com/valeriy42/elasticsearch into bugfix/limited-update-notification-queue

Merge branch 'main' into bugfix/limited-update-notification-queue

29d7f73

DonalEvans reviewed Oct 27, 2025

View reviewed changes

benwtrent reviewed Oct 27, 2025

View reviewed changes

...in/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportPostCalendarEventsAction.java Outdated Show resolved Hide resolved

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/JobManager.java Outdated Show resolved Hide resolved

davidkyle reviewed Oct 30, 2025

View reviewed changes

valeriy42 added 3 commits October 31, 2025 10:04

Merge branch 'main' of https://github.com/elastic/elasticsearch into …

61a61ab

…bugfix/limited-update-notification-queue

reiviewer comments addressed.

d56d390

revert debug logging with supplier function

1becbbc

valeriy42 commented Oct 31, 2025

View reviewed changes

Refactor debug logging in JobManager to use lambda expression for imp…

ca79166

…roved readability

valeriy42 requested review from DonalEvans, benwtrent and davidkyle October 31, 2025 12:32

DonalEvans approved these changes Nov 3, 2025

View reviewed changes

davidkyle approved these changes Nov 5, 2025

View reviewed changes

elasticsearchmachine added v9.1.8 v9.2.2 v8.19.8 and removed v9.1.7 v9.2.1 v8.19.7 labels Nov 6, 2025

Fix ML calendar event update scalability issues #136886

Are you sure you want to change the base?

Fix ML calendar event update scalability issues #136886

Uh oh!

Conversation

valeriy42 commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 21, 2025

Uh oh!

elasticsearchmachine commented Oct 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

valeriy42 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

valeriy42 commented Oct 31, 2025

Uh oh!

valeriy42 commented Oct 31, 2025

Uh oh!

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

valeriy42 commented Oct 21, 2025 •

edited

Loading