Skip to content

Persistent crash loop caused by missing tool_result in conversation history after interrupted execution #3971

@donggyun112

Description

@donggyun112

Describe the bug
The application enters a persistent crash loop when resuming a session where the previous execution was interrupted immediately after a tool_use event but before the corresponding tool_result could be saved.

This interruption typically occurs in two scenarios:

  1. Server-side interruption: Server restarts, OOM crashes, or deployment cycles occurring exactly during tool execution.
  2. Client-side interruption: A user refreshes the browser or closes the connection while the agent is processing a tool call. The server receives a cancellation signal and aborts the task likely without persisting the failure/cancellation state to the database, leaving the history incomplete.

When the session resumes, ADK (via LiteLLM) attempts to send this "corrupted" history (a tool_use in the last assistant message without a following tool_result) to strict LLM providers like Anthropic or OpenAI. These providers reject the request with a BadRequestError (400), rendering the session permanently unrecoverable ("bricked") without manual database intervention.

To Reproduce
I have created a standalone reproduction script using litellm to simulate the API rejection that occurs inside ADK.

Prerequisites:

  • Set ANTHROPIC_API_KEY or OPENAI_API_KEY.
  • Install litellm.

Reproduction Script:

import asyncio
import os
from litellm import acompletion, BadRequestError

async def test_model(model_name: str, api_key_env: str):
    print(f"\n📡 Testing Model: {model_name} ...")
    if not os.getenv(api_key_env):
        print(f"⚠️  SKIPPING: {api_key_env} not found.")
        return

    # Broken history: assistant called a tool, but no tool_result follows.
    broken_messages = [
        {"role": "user", "content": "What is the weather in Seoul?"},
        {
            "role": "assistant",
            "content": "I will check the weather.",
            "tool_calls": [
                {
                    "id": "tool_u_reproduce_123", 
                    "type": "function",
                    "function": {
                        "name": "get_weather",
                        "arguments": "{\"location\": \"Seoul\"}"
                    }
                }
            ]
        },
        # --- MISSING TOOL RESULT DUE TO REFRESH/RESTART ---
        # The next message is a new user turn, violating the strict tool_use -> tool_result sequence.
        {"role": "user", "content": "Wait, tell me about Tokyo instead."}
    ]

    tools = [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"]
            }
        }
    }]

    try:
        await acompletion(model=model_name, messages=broken_messages, tools=tools)
        print(f"❌ {model_name}: PASSED (Resilient).")
    except BadRequestError as e:
        print(f"✅ {model_name}: FAILED (Vulnerable). Crash Reproduced!\n   Error: {e}")

async def run_all_tests():
    for model, key_env in [("anthropic/claude-3-5-sonnet-20240620", "ANTHROPIC_API_KEY"), ("openai/gpt-4o", "OPENAI_API_KEY")]:
        await test_model(model, key_env)

if __name__ == "__main__":
    asyncio.run(run_all_tests())

Expected behavior
The ADK framework should be resilient to history corruption caused by interruptions. When loading or preparing conversation history for LLM execution:

  1. Validation: Detect any assistant message containing tool_calls that is NOT immediately followed by the required tool result messages.
  2. Auto-Healing: Automatically inject a placeholder tool_result (e.g., {"role": "tool", "content": "Error: Execution interrupted (server restart or page refresh).", "tool_call_id": "..."}) into the message list to satisfy API constraints.
  3. Recovery: This allows the LLM to understand that the previous action failed/interrupted and enables the session to continue seamlessly instead of crashing.

Desktop:

  • OS: macOS / Linux
  • Python version: 3.12
  • ADK version: 1.21.0

Model Information:

  • LLM Provider: Anthropic (Claude 3.5 Sonnet), OpenAI (GPT-4o)
  • Interface: LiteLLM

Metadata

Metadata

Labels

answered[Status] This issue has been answered by the maintainercore[Component] This issue is related to the core interface and implementation

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions