-
Notifications
You must be signed in to change notification settings - Fork 4
refactor: improve exactness score calculation in evaluation #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for GitHub.
|
: 0; | ||
|
||
// Apply exact match bonus (safe division) | ||
const exactMatchRate = exactMatches / validationSummaries.totalQuestions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Mismatched Counts Skew Exact Match Rate
The blendedExactnessScore
function calculates the exact match rate using validationSummaries.totalQuestions
, but processes individual scores based on questionKeys
. If these counts don't align, the exact match rate will be inaccurate, affecting the final blended score.
|
||
for (const question of questionKeys) { | ||
const individualScore = getExactnessScore(provider, model, question); | ||
individualScores.push(individualScore); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new implementation includes 0 scores for questions that models never attempted, unfairly lowering their exactness scores compared to the original calculation method.
View Details
📝 Patch Details
diff --git a/src/src/lib/eval.ts b/src/src/lib/eval.ts
index ee19459..ceff5c1 100644
--- a/src/src/lib/eval.ts
+++ b/src/src/lib/eval.ts
@@ -214,29 +214,21 @@ function blendedExactnessScore(provider: string, model: string) {
return 0;
}
- const { totalMatches, exactMatches } = modelStats;
+ const { exactMatches, avgExactDistance, avgNumericDistance, avgFScore } = modelStats;
- // Calculate individual exactness scores for all questions
- const individualScores: number[] = [];
-
- // Get all question keys from validation results
- const questionKeys = Object.keys(validationResults).filter(key => key !== '_summary');
-
- // Validate we have questions to process
- if (questionKeys.length === 0) {
- console.log(`No questions found in validation results for ${modelKey}`);
+ // Validate required aggregate fields exist and are numbers
+ if (
+ typeof avgExactDistance !== 'number' ||
+ typeof avgNumericDistance !== 'number' ||
+ typeof avgFScore !== 'number'
+ ) {
+ console.log(`Invalid aggregate distance data for ${modelKey}`);
return 0;
}
-
- for (const question of questionKeys) {
- const individualScore = getExactnessScore(provider, model, question);
- individualScores.push(individualScore);
- }
-
- // Calculate average of individual scores (safe division)
- const avgIndividualScore = individualScores.length > 0
- ? individualScores.reduce((sum, score) => sum + score, 0) / individualScores.length
- : 0;
+
+ // Use pre-calculated aggregates that only include questions the model attempted
+ // This ensures models aren't penalized for unattempted questions
+ const avgIndividualScore = blendScore(avgExactDistance, avgNumericDistance, avgFScore);
// Apply exact match bonus (safe division)
const exactMatchRate = exactMatches / validationSummaries.totalQuestions;
Analysis
Unfair scoring penalty in blendedExactnessScore() for models with unattempted questions
What fails: blendedExactnessScore() in src/src/lib/eval.ts iterates through ALL 50 questions and calls getExactnessScore(), which returns 0 for unattempted questions, unfairly lowering model scores compared to using pre-calculated aggregates
How to reproduce:
- Check models with unattempted questions (e.g.,
deepseek/deepseek-chat-v3-0324:free
has 4 unattempted questions) - Current implementation averages scores across all 50 questions (including 0s for unattempted)
- Pre-calculated aggregates (
avgExactDistance
,avgNumericDistance
,avgFScore
) only include attempted questions
Result: Models like deepseek/deepseek-chat-v3-0324:free
get artificially low scores (48 vs 56 points) because unattempted questions count as 0 instead of being excluded from calculation
Expected: Use pre-calculated aggregate statistics that only consider questions the model actually attempted, matching the original benchmark methodology that updates stats only when modelResult
exists
Note
Replaces summary-based exactness with per-question averaging plus a capped exact-match bonus, adding validations and rounding.
blendedExactnessScore
:getExactnessScore
across all validation questions instead of using summary distances.blendedExactnessScore
andblendScore
.Written by Cursor Bugbot for commit c9ab549. This will update automatically on new commits. Configure here.