Persistent crash loop caused by missing tool_result in conversation history after interrupted execution

**Describe the bug**
The application enters a persistent crash loop when resuming a session where the previous execution was interrupted immediately after a `tool_use` event but before the corresponding `tool_result` could be saved.

This interruption typically occurs in two scenarios:
1.  **Server-side interruption:** Server restarts, OOM crashes, or deployment cycles occurring exactly during tool execution.
2.  **Client-side interruption:** A user refreshes the browser or closes the connection while the agent is processing a tool call. The server receives a cancellation signal and aborts the task likely **without persisting the failure/cancellation state to the database**, leaving the history incomplete.

When the session resumes, ADK (via LiteLLM) attempts to send this "corrupted" history (a `tool_use` in the last assistant message without a following `tool_result`) to strict LLM providers like Anthropic or OpenAI. These providers reject the request with a **`BadRequestError` (400)**, rendering the session permanently unrecoverable ("bricked") without manual database intervention.

**To Reproduce**
I have created a standalone reproduction script using `litellm` to simulate the API rejection that occurs inside ADK.

**Prerequisites:**
*   Set `ANTHROPIC_API_KEY` or `OPENAI_API_KEY`.
*   Install `litellm`.

**Reproduction Script:**
```python
import asyncio
import os
from litellm import acompletion, BadRequestError

async def test_model(model_name: str, api_key_env: str):
    print(f"\n📡 Testing Model: {model_name} ...")
    if not os.getenv(api_key_env):
        print(f"⚠️  SKIPPING: {api_key_env} not found.")
        return

    # Broken history: assistant called a tool, but no tool_result follows.
    broken_messages = [
        {"role": "user", "content": "What is the weather in Seoul?"},
        {
            "role": "assistant",
            "content": "I will check the weather.",
            "tool_calls": [
                {
                    "id": "tool_u_reproduce_123", 
                    "type": "function",
                    "function": {
                        "name": "get_weather",
                        "arguments": "{\"location\": \"Seoul\"}"
                    }
                }
            ]
        },
        # --- MISSING TOOL RESULT DUE TO REFRESH/RESTART ---
        # The next message is a new user turn, violating the strict tool_use -> tool_result sequence.
        {"role": "user", "content": "Wait, tell me about Tokyo instead."}
    ]

    tools = [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"]
            }
        }
    }]

    try:
        await acompletion(model=model_name, messages=broken_messages, tools=tools)
        print(f"❌ {model_name}: PASSED (Resilient).")
    except BadRequestError as e:
        print(f"✅ {model_name}: FAILED (Vulnerable). Crash Reproduced!\n   Error: {e}")

async def run_all_tests():
    for model, key_env in [("anthropic/claude-3-5-sonnet-20240620", "ANTHROPIC_API_KEY"), ("openai/gpt-4o", "OPENAI_API_KEY")]:
        await test_model(model, key_env)

if __name__ == "__main__":
    asyncio.run(run_all_tests())
```

**Expected behavior**
The ADK framework should be resilient to history corruption caused by interruptions. When loading or preparing conversation history for LLM execution:

1.  **Validation:** Detect any assistant message containing `tool_calls` that is NOT immediately followed by the required tool result messages.
2.  **Auto-Healing:** Automatically inject a placeholder `tool_result` (e.g., `{"role": "tool", "content": "Error: Execution interrupted (server restart or page refresh).", "tool_call_id": "..."}`) into the message list to satisfy API constraints.
3.  **Recovery:** This allows the LLM to understand that the previous action failed/interrupted and enables the session to continue seamlessly instead of crashing.

**Desktop:**
 - **OS:** macOS / Linux
 - **Python version:** 3.12
 - **ADK version:** 1.21.0

**Model Information:**
 - **LLM Provider:** Anthropic (Claude 3.5 Sonnet), OpenAI (GPT-4o)
 - **Interface:** LiteLLM



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Persistent crash loop caused by missing tool_result in conversation history after interrupted execution #3971

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Persistent crash loop caused by missing tool_result in conversation history after interrupted execution #3971

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions