Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save complete trajectory in presence of history truncation #6751

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

li-boxuan
Copy link
Collaborator

@li-boxuan li-boxuan commented Feb 17, 2025

End-user friendly description of the problem this fixes or functionality that this introduces

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

#4977 introduced truncation feature to handle long context errors by cutting the history. Since trajectory saving feature depends on "controller.state.history", this causes saved trajectories to only contain partial history.

Direct cause is that we are not closing the controller properly in headless mode. If we had done so, the history would be recovered properly and thus trajectory would have been complete. This PR fixes this issue.

Note that state.history is being used in many places (including but not limited to a few benchmark evaluation harness), which we might also want to evaluate if they actually need truncated history or full history. Maybe we need better names for truncated/full history.

Also, note that stuck detector also depends on state.history, which could potentially lead to malfunction of stuck detector if the loop appears around the place where we do the truncate.

Anyways, this PR fixes the partial trajectory issue and includes a unit test, which could be used as a testbed for any future renaming/refactoring.~~

Kudos to @adityasoni9998 for finding this issue after analyzing a few evaluation trajectories.


Link of any specific issues this addresses

This also includes a small fix for #6749 in openhands/server/routes/trajectory.py


To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:9962580-nikolaik   --name openhands-app-9962580   docker.all-hands.dev/all-hands-ai/openhands:9962580

@enyst
Copy link
Collaborator

enyst commented Feb 17, 2025

#4977 introduced truncation feature to handle long context errors by cutting the history. Since trajectory saving feature depends on "controller.state.history", this causes saved trajectories to only contain partial history.

There is something I don't understand about this:

# the final state.history will be used by external scripts like evals, tests, etc.

close makes sure that history has all events, so that it's as evals expect it to be: full, as far as I can see, with agent-specific events. It's supposed to happen before the controller state is read in evals and other external scripts. Isn't that happening anymore?

@enyst
Copy link
Collaborator

enyst commented Feb 17, 2025

Cc: @csmith49

To elaborate a bit:

  • when delegates are used, state.history contains only the history of the active agent
  • similarly, in the truncation case, state.history contains only events starting from an id
  • but evals or other external callers of run_controller need full history
  • afaict, controller.close was supposed to work for both, putting Humpty back together
  • when run_controller returns the final state, close must have happened
  • including for exceptions.

TBH this is fragile code, it doesn't feel great to rely on close being called. Still, it's close, it's bit surprising that the controller wouldn't be closed suddenly? I didn't repro the issue. 🤔

The main reason for state.history being the same variable has been that it's so common in agents frameworks/projects/experiments. Everyone seems to expect it, so it didn't seem obvious how else to call things. (When history was a class, there was a history.get_events with flags for all events or current agent.)

Re: stuck
I think it's fine: stuck works with the latest 6 events in the halved history = the latest 6 events in the full history, unless this half is less than 6 events. If it is less, which should be rare if ever, it's not really stuck... 🤔 (if the LLM gets less than 6, plus the task, seems like a very edge case, but more importantly, I guess it should be able to get out by itself...)

@csmith49
Copy link
Collaborator

when delegates are used, state.history contains only the history of the active agent

I'm still working through the control-flow of controller.close. Does this mean that if delegate A truncates, then we switch to delegate B for a bit, when we switch back to delegate A we end up with the non-truncated history?

@enyst
Copy link
Collaborator

enyst commented Feb 17, 2025

when delegates are used, state.history contains only the history of the active agent

I'm still working through the control-flow of controller.close. Does this mean that if delegate A truncates, then we switch to delegate B for a bit, when we switch back to delegate A we end up with the non-truncated history?

🤔 That's a good question! 😅

...Shouldn't change anything, because the B controller closes, and the B state.history may be there, but A controller doesn't read it. It doesn't seem to care about B's state.history:

obs = AgentDelegateObservation(outputs=delegate_outputs, content=content)

A's controller has A's agent state instance, which is different, and A controller.close didn't run yet... if that made sense

@li-boxuan
Copy link
Collaborator Author

li-boxuan commented Feb 17, 2025

#4977 introduced truncation feature to handle long context errors by cutting the history. Since trajectory saving feature depends on "controller.state.history", this causes saved trajectories to only contain partial history.

There is something I don't understand about this:

# the final state.history will be used by external scripts like evals, tests, etc.

close makes sure that history has all events, so that it's as evals expect it to be: full, as far as I can see, with agent-specific events. It's supposed to happen before the controller state is read in evals and other external scripts. Isn't that happening anymore?

Ah interesting... so the get_history() call is supposed to work? But it isn't... (there's no delegation involved in my experiment)

FYI my commit was aae611a, which is built on top of e487008

image

Everything between id 0 and 77 are truncated/lost in the final history.

@li-boxuan
Copy link
Collaborator Author

Oh wait I see the problem! controller.close() is not invoked in main.py, so the headless mode is not closing the controller properly

@li-boxuan li-boxuan force-pushed the boxuanli/traj-save-after-truncation branch from 25739af to e07c89b Compare February 17, 2025 18:36
@li-boxuan
Copy link
Collaborator Author

TBH this is fragile code, it doesn't feel great to rely on close being called.

Yeah I agree, especially fragile if we want to export/save trajectory before closing the controller, or, say, in the middle of a conversation.

The main reason for state.history being the same variable has been that it's so common in agents frameworks/projects/experiments. Everyone seems to expect it, so it didn't seem obvious how else to call things. (When history was a class, there was a history.get_events with flags for all events or current agent.)

Yeah understood, it's hard to deal with legacy esp without enough test coverage

@@ -193,6 +193,8 @@ def on_event(event: Event):
# NOTE: the saved state does not include delegates events
end_state.save_to_session(event_stream.sid, event_stream.file_store)

await controller.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are entirely correct

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be messing up with the event loop... causing tests failure. Will investigate later.

@enyst
Copy link
Collaborator

enyst commented Feb 17, 2025

I was just coming back to this PR after I saw nothing calls it - the agent_session does (but then that's UI only). Some recent-ish refactoring must have broken this. Weird though.

Thank you for the investigation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants