Replies: 4 comments 1 reply
-
|
Microsoft Autogen This is because they have made major changes to the codebas and we must validate that it works as expected + add tests to the latest testing suite. |
Beta Was this translation helpful? Give feedback.
-
|
crewAI codebase contains the agentops integration which should be tested completely either on their codebase or on ours by installing the latest version and testing against it. |
Beta Was this translation helpful? Give feedback.
-
|
Integration tests are testing LLM provider clients. They should rather test the patched providers in |
Beta Was this translation helpful? Give feedback.
-
|
Test coverage for agent systems is genuinely hard because the "correct" behavior is often probabilistic, context-dependent, and hard to specify in a traditional unit test. A few testing strategies that work at different layers: Deterministic input/output tests — for agent behaviors that should be deterministic (format compliance, schema validation, tool call structure), traditional assertion-based tests work fine. Use temperature=0 or seed the sampling for reproducibility. Distribution tests — instead of "does this exact output match", test "does output distribution satisfy this property?" Over N runs: what fraction produce valid JSON? What's the 90th percentile token count? What's the success rate on a defined task set? These properties can be asserted with tolerances. State machine coverage — enumerate all possible agent state transitions (pending → running → delegated → budget-paused → completed/failed) and ensure you have test cases that exercise each edge. Budget exhaustion and delegation failures are common coverage gaps. Contract tests for tool calls — for each tool the agent can call, verify that: (a) the agent correctly formats the call, (b) the agent handles valid responses correctly, (c) the agent handles error responses gracefully without looping. Economic invariant tests — assert that total child agent budgets ≤ parent budget; assert that cost attribution sums correctly through the delegation chain; assert that the budget-exhausted state is reachable and handled correctly. Replay tests — record a successful agent execution trace, then replay it and assert that the cost, steps taken, and output match the baseline within tolerance. Good for regression detection. We test KinthAI's agent protocol layer with a mix of these approaches: https://blog.kinthai.ai/221-agents-multi-agent-coordination-lessons covers the protocol design, which informs what invariants are worth testing. What's causing the most testing pain — non-determinism, cost of running tests, or specification of expected behavior? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
We will use this discussion to report edge cases that are not yet covered in the testing suite but should be implemented.
Beta Was this translation helpful? Give feedback.
All reactions