Problem
Different LLM models behaved inconsistently during agent execution:
- Some showed plans, others didn't
- Approval flow varied by model
- Output formatting differed
Observed Inconsistencies
| Model |
Plan Shown |
Waited for Approval |
Output Format |
| GPT-5-mini |
Yes (multiple) |
No |
Plain text |
| Claude Sonnet 4.5 |
Yes (structured) |
No |
Markdown |
| Gemini |
Yes |
No |
Plain text |
Expected Behavior
Regardless of which LLM provider is used:
- Same plan format displayed
- Same approval flow enforced
- Same tool calling interface
- Consistent output formatting
Implementation Suggestions
The agent layer should normalize behavior:
- Wrap model responses in consistent format
- Enforce approval gates at application level (not model level)
- Standardize output through formatters
// Application-level enforcement, not model-dependent
const executeWithApproval = async (plan: Plan): Promise<void> => {
const approved = await showPlanAndWaitForApproval(plan);
if (!approved) return;
// Execute...
};
Priority
🟡 Medium - Affects user experience and predictability
Generated from model evaluation test