Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better LLM retry behavior #6557

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Better LLM retry behavior #6557

wants to merge 9 commits into from

Conversation

rbren
Copy link
Collaborator

@rbren rbren commented Jan 30, 2025

End-user friendly description of the problem this fixes or functionality that this introduces

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below
    no changelog

Give a summary of what the PR does, explaining any non-trivial design decisions

The LLM is retrying a lot of unrecoverable exceptions, which makes it look like the app is just stuck.

The current configuration also waits a total of 11 minutes (!) for a good response, not including the request time, which can add ~5-8 minutes to that total. So the app looks VERY stuck.

We could potentially move this into a config if these errors are common enough that eval needs them. CC @xingyaoww


Link of any specific issues this addresses


To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:185288b-nikolaik   --name openhands-app-185288b   docker.all-hands.dev/all-hands-ai/openhands:185288b

@rbren rbren changed the title stop retrying on all exceptions Better LLM retry behavior Jan 30, 2025
RateLimitError,
ServiceUnavailableError,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

503 is a transitory error, we could probably keep it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It's transitory but also unexpected...

I'm open to it but I lean towards telling the user their LLM is flaking out rather than OpenHands looking like it's slow

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda agree with you actually. We always had a problem in understanding our retry settings, because it's a bit weird to figure out a sensible default for "unexpected stuff happened".

And now we do allow the user to continue normally after reporting the error.

eval is the exception, I'd love to hear from Xingyao on that.

@enyst
Copy link
Collaborator

enyst commented Jan 30, 2025

There are some issues on litellm on this, the exceptions as defined are mixing permanent and transitory exceptions from the provider. We have some weird code due to that. I would agree that cleaning them and start again is reasonable. 😅

@enyst
Copy link
Collaborator

enyst commented Jan 30, 2025

Small related detail, there's a try/except due to retries in llm.py, which is unnecessary even in main, and more so now. We might as well clean that out:

openhands/llm/llm.py Outdated Show resolved Hide resolved
@enyst
Copy link
Collaborator

enyst commented Feb 13, 2025

Please see also a small follow-up here:

@rbren
Copy link
Collaborator Author

rbren commented Feb 14, 2025

Thanks @enyst! Any lingering issues here?

Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be great if @xingyaoww can take a look, because it's possible that the removed exceptions are happening.

Up to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants