enhancement(5423): added logic to replaces scheduler with long-wait scheduler in case of exceeded unauth response limit #6619

kaanyalti · 2025-01-28T19:25:01Z

Enhancement

What does this PR do?

Removes the forced unenroll from fleet gateway. Adds logic in the fleet gateway to switch out the scheduler used for checkins. If the unauthorized response limit is exceeded, a the scheduler is replaced with one that has a long wait duration. When the gateway receives a successful response, it switches back to using the regular scheduler with the shorter wait duration.

Why is it important?

Currently the agent unenrolls after 7 unauthorized error responses. This can causes problems in disaster recovery scenarios where users may have to manually intervene.

Checklist

I have read and understood the pull request guidelines of this project.
My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
~~[ ] I have added an integration test or an E2E test~~

Disruptive User Impact

None

How to test this PR locally

Create ESS deployment
Build the agent locally
Enroll the agent
In dev tools find the access token and delete it

GET /.security-7/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "name": "AGENT ID"
          }
        }
      ]
    }
  }
}

DELETE /_security/api_key
{
  "ids": ["KEY ID"]
}

Follow the agent logs sudo elastic-agent logs -f
After a while you will see retrieved an invalid api key error '10' times. will use long scheduler error message in the logs

Due to the backoff algorithm used, this test can take a long time. In order to see immediate results comment out the following code block

			if !bo.Wait() {
				if ctx.Err() != nil {
					// if the context is cancelled, break out of the loop
					break
				}

				// This should not really happen, but just in-case this error is used to show that
				// something strange occurred and we want to log it and report it.
				err := errors.New(
					"checkin retry loop was stopped",
					errors.TypeNetwork,
					errors.M(errors.MetaKeyURI, f.client.URI()),
				)

				f.log.Error(err)
				f.errCh <- err
				return nil, err
			}

Related issues

mergify · 2025-01-28T19:25:39Z

This pull request does not have a backport label. Could you fix it @kaanyalti? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit

mergify · 2025-01-28T19:25:40Z

backport-v8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

elasticmachine · 2025-01-30T14:20:38Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

internal/pkg/agent/application/gateway/fleet/fleet_gateway.go

swiatekm · 2025-01-30T15:32:49Z

internal/pkg/agent/application/gateway/fleet/fleet_gateway.go

-	Jitter:   500 * time.Millisecond, // used as a jitter for duration
+	Duration:    1 * time.Second,        // time between successful calls
+	Jitter:      500 * time.Millisecond, // used as a jitter for duration
+	ErrDuration: 1 * time.Hour,          // time between calls when the agent exceeds unauthorized response limit


This seems a bit high for a default. I wouldn't want to go higher than 5 minutes.

The duration for error case was mentioned in the issue by @cmacknz, so that's what I went with, but I can use something shorter.

The initial proposal is that instead of unenrolling, we should switch to checking in once per hour. A successful checkin must return the agent to its original checkin interval.

Some context: This is explicitly handling a force unenroll from the Fleet UI, which just revokes API keys but leaves the agent running.

When this happens agent keeps checkin in indefinitely until the service stops, which might not ever happen. This pollutes our telemetry with very rapid retries of requests that will never succeed and places unnecessary load on Fleet Server. So we tried to detect API key revocation and unenroll when that happens.

This didn't consider the case where some other bug or disaster unintentionally caused mass API key revocation or unavailability (e.g. Fleet Server can't reach Elasticsearch for 1+ hour). We recently had a support case where exactly this happened and once an agent is unenrolled there is no way to recover it.

Since unenroll is destructive and unrecoverable, then the next best thing is reducing the request frequency. This is where my arbitrary number of 1 hour came from after 7 unauthorized request came from.

We should still try to protect ourselves from accidentally checking in at 1 hour intervals. I think this logic now will reset every time the agent restarts, so just rebooting the machine or agent service should put us back on the fast path.

It would probably be better to gradually ramp up the duration to 1 hour instead of just jumping to it immediately, but considering we completely unenrolled before, this is still a net improvement. The threshold (seven consecutive unauthorized requests) that kicks us into this state is also very uncommon.

I would suggest renaming ErrDuration to ErrConsecutiveUnauthDuration to be as specific as possible. We do not want all errors to have a 1 hour retry, only errors after 7 consecutive unauthorized errors.

swiatekm

LGTM, although I still think the default should be lower than 1 hour.

internal/pkg/agent/application/gateway/fleet/fleet_gateway.go

cmacknz · 2025-02-03T19:40:27Z

internal/pkg/agent/application/gateway/fleet/fleet_gateway.go

-	Jitter:   500 * time.Millisecond, // used as a jitter for duration
+	Duration:    1 * time.Second,        // time between successful calls
+	Jitter:      500 * time.Millisecond, // used as a jitter for duration
+	ErrDuration: 1 * time.Hour,          // time between calls when the agent exceeds unauthorized response limit


Some context: This is explicitly handling a force unenroll from the Fleet UI, which just revokes API keys but leaves the agent running.

When this happens agent keeps checkin in indefinitely until the service stops, which might not ever happen. This pollutes our telemetry with very rapid retries of requests that will never succeed and places unnecessary load on Fleet Server. So we tried to detect API key revocation and unenroll when that happens.

This didn't consider the case where some other bug or disaster unintentionally caused mass API key revocation or unavailability (e.g. Fleet Server can't reach Elasticsearch for 1+ hour). We recently had a support case where exactly this happened and once an agent is unenrolled there is no way to recover it.

Since unenroll is destructive and unrecoverable, then the next best thing is reducing the request frequency. This is where my arbitrary number of 1 hour came from after 7 unauthorized request came from.

We should still try to protect ourselves from accidentally checking in at 1 hour intervals. I think this logic now will reset every time the agent restarts, so just rebooting the machine or agent service should put us back on the fast path.

It would probably be better to gradually ramp up the duration to 1 hour instead of just jumping to it immediately, but considering we completely unenrolled before, this is still a net improvement. The threshold (seven consecutive unauthorized requests) that kicks us into this state is also very uncommon.

internal/pkg/agent/application/gateway/fleet/fleet_gateway.go

cmacknz · 2025-02-03T19:49:31Z

internal/pkg/agent/application/gateway/fleet/fleet_gateway.go

-	Jitter:   500 * time.Millisecond, // used as a jitter for duration
+	Duration:    1 * time.Second,        // time between successful calls
+	Jitter:      500 * time.Millisecond, // used as a jitter for duration
+	ErrDuration: 1 * time.Hour,          // time between calls when the agent exceeds unauthorized response limit


I would suggest renaming ErrDuration to ErrConsecutiveUnauthDuration to be as specific as possible. We do not want all errors to have a 1 hour retry, only errors after 7 consecutive unauthorized errors.

internal/pkg/agent/application/gateway/fleet/fleet_gateway_test.go

pierrehilbert · 2025-02-07T06:50:53Z

@cmacknz I think we should also backport this in 9.0 but would like to get your opinion here.

pchila

LGTM
I agree with @cmacknz that we should progressively ramp up the polling interval up to the maximum duration of 1h but that can be done in a follow-up PR

kaanyalti · 2025-02-07T16:24:32Z

Created this issue as a follow up to add gradual ramp up to 1 hour
#6760

…cheduler in case of exceeded unauth response limit

…tests

…o tests, simplified scheduler usage

elastic-sonarqube · 2025-02-07T17:18:50Z

Quality Gate passed

Issues
1 New issue
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
71.4% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

kalramani · 2025-02-11T02:51:52Z

tagging the corresponding elastic case #01815174

mergify bot assigned kaanyalti Jan 28, 2025

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 28, 2025

kaanyalti marked this pull request as ready for review January 30, 2025 01:22

kaanyalti requested a review from a team as a code owner January 30, 2025 01:22

kaanyalti requested review from swiatekm and pchila January 30, 2025 01:22

swiatekm added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jan 30, 2025

swiatekm reviewed Jan 30, 2025

View reviewed changes

kaanyalti requested a review from swiatekm January 31, 2025 03:30

swiatekm approved these changes Jan 31, 2025

View reviewed changes

kaanyalti force-pushed the enhancement/5423_remove_automatic_unenrollment_after_auth_failure branch from 0ae252d to c884611 Compare February 3, 2025 17:27

cmacknz reviewed Feb 3, 2025

View reviewed changes

kaanyalti requested a review from cmacknz February 4, 2025 21:40

cmacknz approved these changes Feb 5, 2025

View reviewed changes

kaanyalti force-pushed the enhancement/5423_remove_automatic_unenrollment_after_auth_failure branch 2 times, most recently from 29c402d to 2ba752a Compare February 6, 2025 21:05

pchila approved these changes Feb 7, 2025

View reviewed changes

kaanyalti mentioned this pull request Feb 7, 2025

Gradually ramp up to wait duration when agent runs into more than 7 authentication errors #6760

Open

2 tasks

kaanyalti added 6 commits February 7, 2025 11:26

enhancement(5423): added logic to replaces scheduler with long-wait s…

1c22a97

…cheduler in case of exceeded unauth response limit

enhancement(5423): removed default case from type switch, added unit …

869c3b3

…tests

enhancement(5423): added blackbox functional tests for gateway Run

abd78e4

enhancement(5423): added changelog

1ca5f40

enhancement(5423): remove tryReplaceScheduler, update tests

b04617d

enhancement(5423): added SetDuration function, added mock scheduler t…

f606ca8

…o tests, simplified scheduler usage

kaanyalti force-pushed the enhancement/5423_remove_automatic_unenrollment_after_auth_failure branch from 130246e to f606ca8 Compare February 7, 2025 16:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhancement(5423): added logic to replaces scheduler with long-wait scheduler in case of exceeded unauth response limit #6619

enhancement(5423): added logic to replaces scheduler with long-wait scheduler in case of exceeded unauth response limit #6619

kaanyalti commented Jan 28, 2025 •

edited

Loading

mergify bot commented Jan 28, 2025

mergify bot commented Jan 28, 2025

elasticmachine commented Jan 30, 2025

swiatekm Jan 30, 2025

kaanyalti Jan 30, 2025 •

edited

Loading

cmacknz Feb 3, 2025

cmacknz Feb 3, 2025

swiatekm left a comment

cmacknz Feb 3, 2025

cmacknz Feb 3, 2025

pierrehilbert commented Feb 7, 2025

pchila left a comment

kaanyalti commented Feb 7, 2025 •

edited

Loading

elastic-sonarqube bot commented Feb 7, 2025

kalramani commented Feb 11, 2025

enhancement(5423): added logic to replaces scheduler with long-wait scheduler in case of exceeded unauth response limit #6619

Are you sure you want to change the base?

enhancement(5423): added logic to replaces scheduler with long-wait scheduler in case of exceeded unauth response limit #6619

Conversation

kaanyalti commented Jan 28, 2025 • edited Loading

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

mergify bot commented Jan 28, 2025

mergify bot commented Jan 28, 2025

elasticmachine commented Jan 30, 2025

swiatekm Jan 30, 2025

Choose a reason for hiding this comment

kaanyalti Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

cmacknz Feb 3, 2025

Choose a reason for hiding this comment

cmacknz Feb 3, 2025

Choose a reason for hiding this comment

swiatekm left a comment

Choose a reason for hiding this comment

cmacknz Feb 3, 2025

Choose a reason for hiding this comment

cmacknz Feb 3, 2025

Choose a reason for hiding this comment

pierrehilbert commented Feb 7, 2025

pchila left a comment

Choose a reason for hiding this comment

kaanyalti commented Feb 7, 2025 • edited Loading

elastic-sonarqube bot commented Feb 7, 2025

Quality Gate passed

kalramani commented Feb 11, 2025

kaanyalti commented Jan 28, 2025 •

edited

Loading

kaanyalti Jan 30, 2025 •

edited

Loading

kaanyalti commented Feb 7, 2025 •

edited

Loading