remove result.success from agent evals & add current date to evaluator #971

tkattkat · 2025-08-20T17:36:40Z

why

the success response from agent can be unreliable, and is redundant as we already have other evaluations in place

what changed

test plan

tested locally

greptile-apps

Greptile Summary

This PR removes the dependency on agentResult.success from agent evaluation logic across 10 agent evaluation files. Previously, these evaluations required both the agent's self-reported success status AND external validation mechanisms (like URL checks, data extraction, or visual evaluation) to determine overall task success. Now they rely solely on the external validation mechanisms.

The changes standardize the evaluation approach across agent tasks by removing what the PR author considers an unreliable signal. For example, in github.ts the success logic changes from agentResult.success && evaluation === "YES" to just evaluation === "YES". Similar patterns are applied across tasks like Google Shopping, Hugging Face, Apple TV, NBA trades, and others.

The changes integrate well with the existing codebase architecture where agent tasks use various validation mechanisms:

Visual evaluation via the Evaluator class that takes screenshots and uses VLM assessment
URL validation to ensure correct navigation
Data extraction verification (checking extracted values match expected results)
Combination approaches using multiple validation layers

This creates a cleaner separation of concerns where agents execute tasks and independent evaluators assess outcomes, rather than having agent self-assessment influence final results. The approach aligns with the principle that objective, external validation is more reliable than programmatic success flags in web automation scenarios.

PR Description Notes:

The "what changed" section is empty but should describe the specific modifications made
The test plan only mentions "tested locally" without details about what was tested

Confidence score: 2/5

This PR introduces critical bugs that will cause immediate failures in production
Score reflects serious implementation errors where agent execution is not properly awaited in multiple files
Pay close attention to steam_games.ts and google_maps_3.ts which have broken async/await patterns that will cause race conditions

_{11 files reviewed, 1 comment}

_{Edit Code Review Bot Settings | Greptile}

greptile-apps · 2025-08-20T17:37:22Z

evals/tasks/agent/steam_games.ts

+    agent.execute({
      instruction:
        "Show most played games in Steam. And tell me the number of players in In game at this time",
      maxSteps: 30,
    });


logic: Missing await keyword - the agent execution is not being waited for, so evaluation will happen before the agent completes its task

Suggested change

agent.execute({

instruction:

"Show most played games in Steam. And tell me the number of players in In game at this time",

maxSteps: 30,

});

const agentResult = await agent.execute({

instruction:

"Show most played games in Steam. And tell me the number of players in In game at this time",

maxSteps: 30,

});

remove result.success from agent evals & add current date to evaluator

afc6322

greptile-apps bot reviewed Aug 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

remove result.success from agent evals & add current date to evaluator #971

remove result.success from agent evals & add current date to evaluator #971

Uh oh!

tkattkat commented Aug 20, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Aug 20, 2025

Uh oh!

Uh oh!

remove result.success from agent evals & add current date to evaluator #971

Are you sure you want to change the base?

remove result.success from agent evals & add current date to evaluator #971

Uh oh!

Conversation

tkattkat commented Aug 20, 2025

why

what changed

test plan

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 2/5

Uh oh!

greptile-apps bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!