-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better LLM retry behavior #6557
base: main
Are you sure you want to change the base?
Conversation
RateLimitError, | ||
ServiceUnavailableError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
503 is a transitory error, we could probably keep it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. It's transitory but also unexpected...
I'm open to it but I lean towards telling the user their LLM is flaking out rather than OpenHands looking like it's slow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda agree with you actually. We always had a problem in understanding our retry settings, because it's a bit weird to figure out a sensible default for "unexpected stuff happened".
And now we do allow the user to continue normally after reporting the error.
eval is the exception, I'd love to hear from Xingyao on that.
There are some issues on litellm on this, the exceptions as defined are mixing permanent and transitory exceptions from the provider. We have some weird code due to that. I would agree that cleaning them and start again is reasonable. 😅 |
Small related detail, there's a try/except due to retries in |
Please see also a small follow-up here: |
Thanks @enyst! Any lingering issues here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be great if @xingyaoww can take a look, because it's possible that the removed exceptions are happening.
Up to you.
End-user friendly description of the problem this fixes or functionality that this introduces
no changelog
Give a summary of what the PR does, explaining any non-trivial design decisions
The LLM is retrying a lot of unrecoverable exceptions, which makes it look like the app is just stuck.
The current configuration also waits a total of 11 minutes (!) for a good response, not including the request time, which can add ~5-8 minutes to that total. So the app looks VERY stuck.
We could potentially move this into a config if these errors are common enough that eval needs them. CC @xingyaoww
Link of any specific issues this addresses
To run this PR locally, use the following command: