feat: add eval tests for analysis and report agents #81

rezabekf · 2025-12-10T10:08:28Z

Issue number: n/a

Summary

Changes

Add DeepEval-based agent evaluation tests for Analysis and Report agents.

evals/test_analysis.py - Full workflow test with mocked AWS tools
evals/test_report.py - Report generation test with mocked storage
evals/conftest.py - Shared fixtures and tool factories
evals/mock_data.py - Mock AWS API responses and analysis result
evals/helpers.py - Standardized result output

Uses ToolCorrectnessMetric and TaskCompletionMetric with Llama 3.3 70B as judge.

Run: make eval AGENT=analysis or make eval AGENT=report

User experience

Before: No evals.

After: Run make eval AGENT=<name> to validate agent workflows with deterministic mock data and LLM-judged metrics.

Test outputs:

test_analysis.py

----------------------------------------
TOOL CORRECTNESS
----------------------------------------
Score: 0.75
Reason: [
         Tool Calling Reason: All expected tools ['journal', 'use_aws', 'journal', 'journal', 'current_time_unix_utc', 'journal', 'journal', 'journal', 'journal', 'journal', 'journal', 'journal', 'storage'] were called (order not considered).
         Tool Selection Reason: The agent selected tools that were mostly appropriate for the task, such as 'use_aws' for calling AWS APIs and 'calculator' for cost calculations. However, the frequent use of 'journal' for tracking workflow phases could be considered unnecessary or redundant in some cases, and 'storage' was only used once at the end for writing analysis results, which might indicate a minor omission in utilizing it for storing intermediate results.
]


----------------------------------------
TASK COMPLETION
----------------------------------------
Score: 0.9
Reason: The system analyzed Lambda functions, identified cost optimization opportunities, and provided specific recommendations with savings estimates, but it is unclear if all CloudWatch metrics were collected and if the analysis results were saved as expected.

----------------------------------------
TOKEN USAGE
----------------------------------------
  Input Tokens:  527139
  Output Tokens: 7737
  Total Tokens:  534876

============================================================
PASSEDRunning teardown with pytest sessionfinish...

test_report.py

----------------------------------------
TOOL CORRECTNESS
----------------------------------------
Score: 0.75
Reason: [
         Tool Calling Reason: All expected tools ['storage', 'journal', 'journal', 'journal', 'storage', 'storage', 'journal'] were called (order not considered).
         Tool Selection Reason: The agent selected the 'journal' tool to track workflow phases and the 'storage' tool to read and write files, which are appropriate for generating a cost optimization report. However, the agent used the 'journal' tool to track multiple phases, which might be unnecessary, and there is no clear indication that the 'storage' tool was used to read relevant data for the report, only to write output files. The tool selection is mostly aligned with the task, but there are minor unnecessary uses.
]


----------------------------------------
TASK COMPLETION
----------------------------------------
Score: 0.9
Reason: The system generated the cost optimization report, identified potential savings, and saved the report and evidence files to S3, but the task specified a report with an executive summary and recommendations, which may not be fully addressed by the generated files.

----------------------------------------
TOKEN USAGE
----------------------------------------
  Input Tokens:  51499
  Output Tokens: 6075
  Total Tokens:  57574

============================================================
PASSEDRunning teardown with pytest sessionfinish...

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

github-actions · 2025-12-10T10:09:04Z

Coverage report

This PR does not seem to contain any modification to coverable code.

rezabekf changed the title ~~feat: add eval tests for individual phases~~ feat: add eval tests for analysis and report agents Dec 10, 2025

rezabekf force-pushed the rezabekf/eval branch from 68521b9 to 9a046f2 Compare December 10, 2025 19:45

rezabekf marked this pull request as ready for review December 16, 2025 16:18

rezabekf requested a review from a team as a code owner December 16, 2025 16:18

rezabekf self-assigned this Dec 16, 2025

rezabekf force-pushed the rezabekf/eval branch from 87a0c0a to 9a046f2 Compare January 27, 2026 09:55

rezabekf added 9 commits January 27, 2026 10:06

feat: first eval test

e9e44c4

refactor: eval test strcucture

4535469

fix: use constants from file

0c292e8

fix: refactor the tests

b1601f6

feat: add log insights

5459625

feat: add analysis full prompt test

3433d0a

feat: add report test

22793b3

fix: remove missleading duration

6db3c8f

chore: upgrade dependencies

566d236

rezabekf force-pushed the rezabekf/eval branch from 9a046f2 to 566d236 Compare January 27, 2026 10:13

fix: update deepeval AmazonBedrockModel params for v3.8.1

d5525be

rezabekf requested a review from hjgraca January 27, 2026 10:59

chore: bump deepeval version

f6b1550

hjgraca approved these changes Feb 2, 2026

View reviewed changes

rezabekf merged commit 1cf28f1 into main Feb 2, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add eval tests for analysis and report agents #81

feat: add eval tests for analysis and report agents #81

Uh oh!

rezabekf commented Dec 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add eval tests for analysis and report agents #81

feat: add eval tests for analysis and report agents #81

Uh oh!

Conversation

rezabekf commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

User experience

Test outputs:

test_analysis.py

test_report.py

Uh oh!

github-actions bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rezabekf commented Dec 10, 2025 •

edited

Loading

github-actions bot commented Dec 10, 2025 •

edited

Loading