-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
Describe the bug
The application enters a persistent crash loop when resuming a session where the previous execution was interrupted immediately after a tool_use event but before the corresponding tool_result could be saved.
This interruption typically occurs in two scenarios:
- Server-side interruption: Server restarts, OOM crashes, or deployment cycles occurring exactly during tool execution.
- Client-side interruption: A user refreshes the browser or closes the connection while the agent is processing a tool call. The server receives a cancellation signal and aborts the task likely without persisting the failure/cancellation state to the database, leaving the history incomplete.
When the session resumes, ADK (via LiteLLM) attempts to send this "corrupted" history (a tool_use in the last assistant message without a following tool_result) to strict LLM providers like Anthropic or OpenAI. These providers reject the request with a BadRequestError (400), rendering the session permanently unrecoverable ("bricked") without manual database intervention.
To Reproduce
I have created a standalone reproduction script using litellm to simulate the API rejection that occurs inside ADK.
Prerequisites:
- Set
ANTHROPIC_API_KEYorOPENAI_API_KEY. - Install
litellm.
Reproduction Script:
import asyncio
import os
from litellm import acompletion, BadRequestError
async def test_model(model_name: str, api_key_env: str):
print(f"\n📡 Testing Model: {model_name} ...")
if not os.getenv(api_key_env):
print(f"⚠️ SKIPPING: {api_key_env} not found.")
return
# Broken history: assistant called a tool, but no tool_result follows.
broken_messages = [
{"role": "user", "content": "What is the weather in Seoul?"},
{
"role": "assistant",
"content": "I will check the weather.",
"tool_calls": [
{
"id": "tool_u_reproduce_123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"Seoul\"}"
}
}
]
},
# --- MISSING TOOL RESULT DUE TO REFRESH/RESTART ---
# The next message is a new user turn, violating the strict tool_use -> tool_result sequence.
{"role": "user", "content": "Wait, tell me about Tokyo instead."}
]
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}
}]
try:
await acompletion(model=model_name, messages=broken_messages, tools=tools)
print(f"❌ {model_name}: PASSED (Resilient).")
except BadRequestError as e:
print(f"✅ {model_name}: FAILED (Vulnerable). Crash Reproduced!\n Error: {e}")
async def run_all_tests():
for model, key_env in [("anthropic/claude-3-5-sonnet-20240620", "ANTHROPIC_API_KEY"), ("openai/gpt-4o", "OPENAI_API_KEY")]:
await test_model(model, key_env)
if __name__ == "__main__":
asyncio.run(run_all_tests())Expected behavior
The ADK framework should be resilient to history corruption caused by interruptions. When loading or preparing conversation history for LLM execution:
- Validation: Detect any assistant message containing
tool_callsthat is NOT immediately followed by the required tool result messages. - Auto-Healing: Automatically inject a placeholder
tool_result(e.g.,{"role": "tool", "content": "Error: Execution interrupted (server restart or page refresh).", "tool_call_id": "..."}) into the message list to satisfy API constraints. - Recovery: This allows the LLM to understand that the previous action failed/interrupted and enables the session to continue seamlessly instead of crashing.
Desktop:
- OS: macOS / Linux
- Python version: 3.12
- ADK version: 1.21.0
Model Information:
- LLM Provider: Anthropic (Claude 3.5 Sonnet), OpenAI (GPT-4o)
- Interface: LiteLLM