Skip to content

Conversation

@rezabekf
Copy link
Contributor

@rezabekf rezabekf commented Dec 10, 2025

Issue number: n/a

Summary

Changes

Add DeepEval-based agent evaluation tests for Analysis and Report agents.

  • evals/test_analysis.py - Full workflow test with mocked AWS tools
  • evals/test_report.py - Report generation test with mocked storage
  • evals/conftest.py - Shared fixtures and tool factories
  • evals/mock_data.py - Mock AWS API responses and analysis result
  • evals/helpers.py - Standardized result output

Uses ToolCorrectnessMetric and TaskCompletionMetric with Llama 3.3 70B as judge.

Run: make eval AGENT=analysis or make eval AGENT=report

User experience

Before: No evals.

After: Run make eval AGENT=<name> to validate agent workflows with deterministic mock data and LLM-judged metrics.

Test outputs:

test_analysis.py

----------------------------------------
TOOL CORRECTNESS
----------------------------------------
Score: 0.75
Reason: [
         Tool Calling Reason: All expected tools ['journal', 'use_aws', 'journal', 'journal', 'current_time_unix_utc', 'journal', 'journal', 'journal', 'journal', 'journal', 'journal', 'journal', 'storage'] were called (order not considered).
         Tool Selection Reason: The agent selected tools that were mostly appropriate for the task, such as 'use_aws' for calling AWS APIs and 'calculator' for cost calculations. However, the frequent use of 'journal' for tracking workflow phases could be considered unnecessary or redundant in some cases, and 'storage' was only used once at the end for writing analysis results, which might indicate a minor omission in utilizing it for storing intermediate results.
]


----------------------------------------
TASK COMPLETION
----------------------------------------
Score: 0.9
Reason: The system analyzed Lambda functions, identified cost optimization opportunities, and provided specific recommendations with savings estimates, but it is unclear if all CloudWatch metrics were collected and if the analysis results were saved as expected.

----------------------------------------
TOKEN USAGE
----------------------------------------
  Input Tokens:  527139
  Output Tokens: 7737
  Total Tokens:  534876

============================================================
PASSEDRunning teardown with pytest sessionfinish...

test_report.py

----------------------------------------
TOOL CORRECTNESS
----------------------------------------
Score: 0.75
Reason: [
         Tool Calling Reason: All expected tools ['storage', 'journal', 'journal', 'journal', 'storage', 'storage', 'journal'] were called (order not considered).
         Tool Selection Reason: The agent selected the 'journal' tool to track workflow phases and the 'storage' tool to read and write files, which are appropriate for generating a cost optimization report. However, the agent used the 'journal' tool to track multiple phases, which might be unnecessary, and there is no clear indication that the 'storage' tool was used to read relevant data for the report, only to write output files. The tool selection is mostly aligned with the task, but there are minor unnecessary uses.
]


----------------------------------------
TASK COMPLETION
----------------------------------------
Score: 0.9
Reason: The system generated the cost optimization report, identified potential savings, and saved the report and evidence files to S3, but the task specified a report with an executive summary and recommendations, which may not be fully addressed by the generated files.

----------------------------------------
TOKEN USAGE
----------------------------------------
  Input Tokens:  51499
  Output Tokens: 6075
  Total Tokens:  57574

============================================================
PASSEDRunning teardown with pytest sessionfinish...

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@github-actions
Copy link

github-actions bot commented Dec 10, 2025

Coverage report

This PR does not seem to contain any modification to coverable code.

@rezabekf rezabekf changed the title feat: add eval tests for individual phases feat: add eval tests for analysis and report agents Dec 10, 2025
@rezabekf rezabekf marked this pull request as ready for review December 16, 2025 16:18
@rezabekf rezabekf requested a review from a team as a code owner December 16, 2025 16:18
@rezabekf rezabekf self-assigned this Dec 16, 2025
@rezabekf rezabekf requested a review from hjgraca January 27, 2026 10:59
@rezabekf rezabekf merged commit 1cf28f1 into main Feb 2, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants