Skip to content

Conversation

alrocar
Copy link
Member

@alrocar alrocar commented Aug 5, 2025

Note

Replaces summary-based exactness with per-question averaging plus a capped exact-match bonus, adding validations and rounding.

  • Evaluation:
    • blendedExactnessScore:
      • Computes average of per-question getExactnessScore across all validation questions instead of using summary distances.
      • Treats missing/failed queries as 0; applies a capped exact-match bonus without exceeding 100; rounds final score.
      • Adds robust validation/guardrails for missing model stats, invalid fields, zero questions, and non-finite results.
    • Docs: Adds JSDoc comments for blendedExactnessScore and blendScore.

Written by Cursor Bugbot for commit c9ab549. This will update automatically on new commits. Configure here.

Copy link

vercel bot commented Aug 5, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
llm-benchmark Ready Ready Preview Comment Sep 29, 2025 5:22pm

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

: 0;

// Apply exact match bonus (safe division)
const exactMatchRate = exactMatches / validationSummaries.totalQuestions;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Mismatched Counts Skew Exact Match Rate

The blendedExactnessScore function calculates the exact match rate using validationSummaries.totalQuestions, but processes individual scores based on questionKeys. If these counts don't align, the exact match rate will be inaccurate, affecting the final blended score.

Fix in Cursor Fix in Web

Comment on lines +230 to +235

for (const question of questionKeys) {
const individualScore = getExactnessScore(provider, model, question);
individualScores.push(individualScore);
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new implementation includes 0 scores for questions that models never attempted, unfairly lowering their exactness scores compared to the original calculation method.

View Details
📝 Patch Details
diff --git a/src/src/lib/eval.ts b/src/src/lib/eval.ts
index ee19459..ceff5c1 100644
--- a/src/src/lib/eval.ts
+++ b/src/src/lib/eval.ts
@@ -214,29 +214,21 @@ function blendedExactnessScore(provider: string, model: string) {
     return 0;
   }
 
-  const { totalMatches, exactMatches } = modelStats;
+  const { exactMatches, avgExactDistance, avgNumericDistance, avgFScore } = modelStats;
 
-  // Calculate individual exactness scores for all questions
-  const individualScores: number[] = [];
-  
-  // Get all question keys from validation results
-  const questionKeys = Object.keys(validationResults).filter(key => key !== '_summary');
-  
-  // Validate we have questions to process
-  if (questionKeys.length === 0) {
-    console.log(`No questions found in validation results for ${modelKey}`);
+  // Validate required aggregate fields exist and are numbers
+  if (
+    typeof avgExactDistance !== 'number' ||
+    typeof avgNumericDistance !== 'number' ||
+    typeof avgFScore !== 'number'
+  ) {
+    console.log(`Invalid aggregate distance data for ${modelKey}`);
     return 0;
   }
-  
-  for (const question of questionKeys) {
-    const individualScore = getExactnessScore(provider, model, question);
-    individualScores.push(individualScore);
-  }
-  
-  // Calculate average of individual scores (safe division)
-  const avgIndividualScore = individualScores.length > 0 
-    ? individualScores.reduce((sum, score) => sum + score, 0) / individualScores.length
-    : 0;
+
+  // Use pre-calculated aggregates that only include questions the model attempted
+  // This ensures models aren't penalized for unattempted questions
+  const avgIndividualScore = blendScore(avgExactDistance, avgNumericDistance, avgFScore);
   
   // Apply exact match bonus (safe division)
   const exactMatchRate = exactMatches / validationSummaries.totalQuestions;

Analysis

Unfair scoring penalty in blendedExactnessScore() for models with unattempted questions

What fails: blendedExactnessScore() in src/src/lib/eval.ts iterates through ALL 50 questions and calls getExactnessScore(), which returns 0 for unattempted questions, unfairly lowering model scores compared to using pre-calculated aggregates

How to reproduce:

  1. Check models with unattempted questions (e.g., deepseek/deepseek-chat-v3-0324:free has 4 unattempted questions)
  2. Current implementation averages scores across all 50 questions (including 0s for unattempted)
  3. Pre-calculated aggregates (avgExactDistance, avgNumericDistance, avgFScore) only include attempted questions

Result: Models like deepseek/deepseek-chat-v3-0324:free get artificially low scores (48 vs 56 points) because unattempted questions count as 0 instead of being excluded from calculation

Expected: Use pre-calculated aggregate statistics that only consider questions the model actually attempted, matching the original benchmark methodology that updates stats only when modelResult exists

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant