Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement(5423): added logic to replaces scheduler with long-wait scheduler in case of exceeded unauth response limit #6619

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

kaanyalti
Copy link
Contributor

@kaanyalti kaanyalti commented Jan 28, 2025

  • Enhancement

What does this PR do?

Removes the forced unenroll from fleet gateway. Adds logic in the fleet gateway to switch out the scheduler used for checkins. If the unauthorized response limit is exceeded, a the scheduler is replaced with one that has a long wait duration. When the gateway receives a successful response, it switches back to using the regular scheduler with the shorter wait duration.

Why is it important?

Currently the agent unenrolls after 7 unauthorized error responses. This can causes problems in disaster recovery scenarios where users may have to manually intervene.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Disruptive User Impact

None

How to test this PR locally

  • Create ESS deployment
  • Build the agent locally
  • Enroll the agent
  • In dev tools find the access token and delete it
GET /.security-7/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "name": "AGENT ID"
          }
        }
      ]
    }
  }
}
DELETE /_security/api_key
{
  "ids": ["KEY ID"]
}
  • Follow the agent logs sudo elastic-agent logs -f
  • After a while you will see retrieved an invalid api key error '10' times. will use long scheduler error message in the logs

Due to the backoff algorithm used, this test can take a long time. In order to see immediate results comment out the following code block

			if !bo.Wait() {
				if ctx.Err() != nil {
					// if the context is cancelled, break out of the loop
					break
				}

				// This should not really happen, but just in-case this error is used to show that
				// something strange occurred and we want to log it and report it.
				err := errors.New(
					"checkin retry loop was stopped",
					errors.TypeNetwork,
					errors.M(errors.MetaKeyURI, f.client.URI()),
				)

				f.log.Error(err)
				f.errCh <- err
				return nil, err
			}

Related issues

Copy link
Contributor

mergify bot commented Jan 28, 2025

This pull request does not have a backport label. Could you fix it @kaanyalti? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Jan 28, 2025

backport-v8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 28, 2025
@kaanyalti kaanyalti marked this pull request as ready for review January 30, 2025 01:22
@kaanyalti kaanyalti requested a review from a team as a code owner January 30, 2025 01:22
@kaanyalti kaanyalti requested review from swiatekm and pchila January 30, 2025 01:22
@swiatekm swiatekm added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jan 30, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Jitter: 500 * time.Millisecond, // used as a jitter for duration
Duration: 1 * time.Second, // time between successful calls
Jitter: 500 * time.Millisecond, // used as a jitter for duration
ErrDuration: 1 * time.Hour, // time between calls when the agent exceeds unauthorized response limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit high for a default. I wouldn't want to go higher than 5 minutes.

Copy link
Contributor Author

@kaanyalti kaanyalti Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duration for error case was mentioned in the issue by @cmacknz, so that's what I went with, but I can use something shorter.

The initial proposal is that instead of unenrolling, we should switch to checking in once per hour. A successful checkin must return the agent to its original checkin interval.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some context: This is explicitly handling a force unenroll from the Fleet UI, which just revokes API keys but leaves the agent running.

When this happens agent keeps checkin in indefinitely until the service stops, which might not ever happen. This pollutes our telemetry with very rapid retries of requests that will never succeed and places unnecessary load on Fleet Server. So we tried to detect API key revocation and unenroll when that happens.

This didn't consider the case where some other bug or disaster unintentionally caused mass API key revocation or unavailability (e.g. Fleet Server can't reach Elasticsearch for 1+ hour). We recently had a support case where exactly this happened and once an agent is unenrolled there is no way to recover it.

Since unenroll is destructive and unrecoverable, then the next best thing is reducing the request frequency. This is where my arbitrary number of 1 hour came from after 7 unauthorized request came from.

We should still try to protect ourselves from accidentally checking in at 1 hour intervals. I think this logic now will reset every time the agent restarts, so just rebooting the machine or agent service should put us back on the fast path.

It would probably be better to gradually ramp up the duration to 1 hour instead of just jumping to it immediately, but considering we completely unenrolled before, this is still a net improvement. The threshold (seven consecutive unauthorized requests) that kicks us into this state is also very uncommon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest renaming ErrDuration to ErrConsecutiveUnauthDuration to be as specific as possible. We do not want all errors to have a 1 hour retry, only errors after 7 consecutive unauthorized errors.

@kaanyalti kaanyalti requested a review from swiatekm January 31, 2025 03:30
Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I still think the default should be lower than 1 hour.

@kaanyalti kaanyalti force-pushed the enhancement/5423_remove_automatic_unenrollment_after_auth_failure branch from 0ae252d to c884611 Compare February 3, 2025 17:27
Jitter: 500 * time.Millisecond, // used as a jitter for duration
Duration: 1 * time.Second, // time between successful calls
Jitter: 500 * time.Millisecond, // used as a jitter for duration
ErrDuration: 1 * time.Hour, // time between calls when the agent exceeds unauthorized response limit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some context: This is explicitly handling a force unenroll from the Fleet UI, which just revokes API keys but leaves the agent running.

When this happens agent keeps checkin in indefinitely until the service stops, which might not ever happen. This pollutes our telemetry with very rapid retries of requests that will never succeed and places unnecessary load on Fleet Server. So we tried to detect API key revocation and unenroll when that happens.

This didn't consider the case where some other bug or disaster unintentionally caused mass API key revocation or unavailability (e.g. Fleet Server can't reach Elasticsearch for 1+ hour). We recently had a support case where exactly this happened and once an agent is unenrolled there is no way to recover it.

Since unenroll is destructive and unrecoverable, then the next best thing is reducing the request frequency. This is where my arbitrary number of 1 hour came from after 7 unauthorized request came from.

We should still try to protect ourselves from accidentally checking in at 1 hour intervals. I think this logic now will reset every time the agent restarts, so just rebooting the machine or agent service should put us back on the fast path.

It would probably be better to gradually ramp up the duration to 1 hour instead of just jumping to it immediately, but considering we completely unenrolled before, this is still a net improvement. The threshold (seven consecutive unauthorized requests) that kicks us into this state is also very uncommon.

Jitter: 500 * time.Millisecond, // used as a jitter for duration
Duration: 1 * time.Second, // time between successful calls
Jitter: 500 * time.Millisecond, // used as a jitter for duration
ErrDuration: 1 * time.Hour, // time between calls when the agent exceeds unauthorized response limit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest renaming ErrDuration to ErrConsecutiveUnauthDuration to be as specific as possible. We do not want all errors to have a 1 hour retry, only errors after 7 consecutive unauthorized errors.

@kaanyalti kaanyalti requested a review from cmacknz February 4, 2025 21:40
@kaanyalti kaanyalti force-pushed the enhancement/5423_remove_automatic_unenrollment_after_auth_failure branch 2 times, most recently from 29c402d to 2ba752a Compare February 6, 2025 21:05
@pierrehilbert
Copy link
Contributor

@cmacknz I think we should also backport this in 9.0 but would like to get your opinion here.

Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
I agree with @cmacknz that we should progressively ramp up the polling interval up to the maximum duration of 1h but that can be done in a follow-up PR

@kaanyalti
Copy link
Contributor Author

kaanyalti commented Feb 7, 2025

Created this issue as a follow up to add gradual ramp up to 1 hour
#6760

@kaanyalti kaanyalti force-pushed the enhancement/5423_remove_automatic_unenrollment_after_auth_failure branch from 130246e to f606ca8 Compare February 7, 2025 16:26
@kalramani
Copy link

tagging the corresponding elastic case #01815174

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove automatic unenrollment after 7 Fleet authentication failures
7 participants