-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancement(5423): added logic to replaces scheduler with long-wait scheduler in case of exceeded unauth response limit #6619
base: main
Are you sure you want to change the base?
Conversation
This pull request does not have a backport label. Could you fix it @kaanyalti? 🙏
|
|
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
Jitter: 500 * time.Millisecond, // used as a jitter for duration | ||
Duration: 1 * time.Second, // time between successful calls | ||
Jitter: 500 * time.Millisecond, // used as a jitter for duration | ||
ErrDuration: 1 * time.Hour, // time between calls when the agent exceeds unauthorized response limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit high for a default. I wouldn't want to go higher than 5 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The duration for error case was mentioned in the issue by @cmacknz, so that's what I went with, but I can use something shorter.
The initial proposal is that instead of unenrolling, we should switch to checking in once per hour. A successful checkin must return the agent to its original checkin interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some context: This is explicitly handling a force unenroll from the Fleet UI, which just revokes API keys but leaves the agent running.
When this happens agent keeps checkin in indefinitely until the service stops, which might not ever happen. This pollutes our telemetry with very rapid retries of requests that will never succeed and places unnecessary load on Fleet Server. So we tried to detect API key revocation and unenroll when that happens.
This didn't consider the case where some other bug or disaster unintentionally caused mass API key revocation or unavailability (e.g. Fleet Server can't reach Elasticsearch for 1+ hour). We recently had a support case where exactly this happened and once an agent is unenrolled there is no way to recover it.
Since unenroll is destructive and unrecoverable, then the next best thing is reducing the request frequency. This is where my arbitrary number of 1 hour came from after 7 unauthorized request came from.
We should still try to protect ourselves from accidentally checking in at 1 hour intervals. I think this logic now will reset every time the agent restarts, so just rebooting the machine or agent service should put us back on the fast path.
It would probably be better to gradually ramp up the duration to 1 hour instead of just jumping to it immediately, but considering we completely unenrolled before, this is still a net improvement. The threshold (seven consecutive unauthorized requests) that kicks us into this state is also very uncommon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest renaming ErrDuration
to ErrConsecutiveUnauthDuration
to be as specific as possible. We do not want all errors to have a 1 hour retry, only errors after 7 consecutive unauthorized errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although I still think the default should be lower than 1 hour.
0ae252d
to
c884611
Compare
Jitter: 500 * time.Millisecond, // used as a jitter for duration | ||
Duration: 1 * time.Second, // time between successful calls | ||
Jitter: 500 * time.Millisecond, // used as a jitter for duration | ||
ErrDuration: 1 * time.Hour, // time between calls when the agent exceeds unauthorized response limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some context: This is explicitly handling a force unenroll from the Fleet UI, which just revokes API keys but leaves the agent running.
When this happens agent keeps checkin in indefinitely until the service stops, which might not ever happen. This pollutes our telemetry with very rapid retries of requests that will never succeed and places unnecessary load on Fleet Server. So we tried to detect API key revocation and unenroll when that happens.
This didn't consider the case where some other bug or disaster unintentionally caused mass API key revocation or unavailability (e.g. Fleet Server can't reach Elasticsearch for 1+ hour). We recently had a support case where exactly this happened and once an agent is unenrolled there is no way to recover it.
Since unenroll is destructive and unrecoverable, then the next best thing is reducing the request frequency. This is where my arbitrary number of 1 hour came from after 7 unauthorized request came from.
We should still try to protect ourselves from accidentally checking in at 1 hour intervals. I think this logic now will reset every time the agent restarts, so just rebooting the machine or agent service should put us back on the fast path.
It would probably be better to gradually ramp up the duration to 1 hour instead of just jumping to it immediately, but considering we completely unenrolled before, this is still a net improvement. The threshold (seven consecutive unauthorized requests) that kicks us into this state is also very uncommon.
Jitter: 500 * time.Millisecond, // used as a jitter for duration | ||
Duration: 1 * time.Second, // time between successful calls | ||
Jitter: 500 * time.Millisecond, // used as a jitter for duration | ||
ErrDuration: 1 * time.Hour, // time between calls when the agent exceeds unauthorized response limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest renaming ErrDuration
to ErrConsecutiveUnauthDuration
to be as specific as possible. We do not want all errors to have a 1 hour retry, only errors after 7 consecutive unauthorized errors.
internal/pkg/agent/application/gateway/fleet/fleet_gateway_test.go
Outdated
Show resolved
Hide resolved
internal/pkg/agent/application/gateway/fleet/fleet_gateway_test.go
Outdated
Show resolved
Hide resolved
29c402d
to
2ba752a
Compare
@cmacknz I think we should also backport this in 9.0 but would like to get your opinion here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I agree with @cmacknz that we should progressively ramp up the polling interval up to the maximum duration of 1h but that can be done in a follow-up PR
Created this issue as a follow up to add gradual ramp up to 1 hour |
…cheduler in case of exceeded unauth response limit
…o tests, simplified scheduler usage
130246e
to
f606ca8
Compare
|
tagging the corresponding elastic case #01815174 |
What does this PR do?
Removes the forced unenroll from fleet gateway. Adds logic in the fleet gateway to switch out the scheduler used for checkins. If the unauthorized response limit is exceeded, a the scheduler is replaced with one that has a long wait duration. When the gateway receives a successful response, it switches back to using the regular scheduler with the shorter wait duration.
Why is it important?
Currently the agent unenrolls after 7 unauthorized error responses. This can causes problems in disaster recovery scenarios where users may have to manually intervene.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files./changelog/fragments
using the changelog tool[ ] I have added an integration test or an E2E testDisruptive User Impact
None
How to test this PR locally
sudo elastic-agent logs -f
retrieved an invalid api key error '10' times. will use long scheduler
error message in the logsDue to the backoff algorithm used, this test can take a long time. In order to see immediate results comment out the following code block
Related issues