-
Notifications
You must be signed in to change notification settings - Fork 735
Description
Problem Statement
The OpenAIResponsesModel currently resends the full conversation history on every turn, even though the Responses API natively
supports server-side conversation state via previous_response_id. The docstring at the top of openai_responses.py acknowledges
this:
"The Responses API can maintain conversation state server-side through 'previous_response_id'... Note: This implementation currently only implements the stateless approach."
For agentic applications with multi-turn conversations (10+ turns with tool calls), this means:
- Token costs scale linearly with conversation length — turn 10 resends all previous turns
- Latency increases as the input payload grows
- Context window pressure — long conversations hit limits faster, not because of new content, but because of repeated history
This affects both OpenAI and xAI Responses API users, since both support previous_response_id.
Proposed Solution
- After each successful response in stream(), capture and store response.id from the completed response event.
- On subsequent calls, if a previous_response_id is available, pass it in the request instead of the full message history. Only
send the new user message(s) and tool results in input. - Fall back to the current stateless approach (full history) if:
- No previous response ID exists (first turn)
- The stored response has expired (30-day server retention)
- The API returns an error indicating the previous response is invalid - Expose a configuration option to enable/disable this behavior (e.g., stateful=True in config), defaulting to disabled for
backward compatibility.
The key changes would be in:
- stream() — capture response.id from response.completed event
- _format_request() — conditionally pass previous_response_id instead of full history
- State management — store the last response ID (could be returned as metadata alongside usage stats)
Use Case
We run a property management AI agent on Bedrock AgentCore using Strands. Each session is a multi-turn conversation where the PM
asks for analysis, creates action plans, drafts emails, and executes tasks. A typical session is 10-20 turns with heavy tool
use (each turn may involve 2-5 tool calls).
Today, turn 15 of a conversation resends all 14 previous turns plus their tool call/result pairs. With previous_response_id,
turn 15 would only send the new user message — the server already has the rest.
This would help with:
- Cost reduction — estimated 40-60% input token savings for typical multi-turn sessions
- Faster responses — less data to transmit and process per turn
- Longer conversations — more room in the context window for actual content instead of repeated history
Alternatives Solutions
Application-level caching/summarization - can get messy
Additional Context
No response