-
Notifications
You must be signed in to change notification settings - Fork 4
refactor: improve exactness score calculation in evaluation #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
alrocar
wants to merge
3
commits into
main
Choose a base branch
from
v2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -175,9 +175,21 @@ export function calculateRanks(metrics: ModelMetrics[]): ModelMetrics[] { | |
}); | ||
} | ||
|
||
/** | ||
* Calculates a comprehensive exactness score for a model based on validation results. | ||
* | ||
* This function calculates the average of individual exactness scores across all questions, | ||
* which provides a more accurate representation than using average distance metrics. | ||
* | ||
* The score properly accounts for: | ||
* 1. Individual question scores (including perfect 100 scores) | ||
* 2. Failed queries (scored as 0) | ||
* 3. Exact match bonus for perfect accuracy | ||
*/ | ||
function blendedExactnessScore(provider: string, model: string) { | ||
const modelKey = `${provider}/${model}`; | ||
|
||
// Validate that model stats exist | ||
if ( | ||
!validationSummaries.modelStats[ | ||
modelKey as keyof typeof validationSummaries.modelStats | ||
|
@@ -187,15 +199,71 @@ function blendedExactnessScore(provider: string, model: string) { | |
return 0; | ||
} | ||
|
||
const { avgExactDistance, avgNumericDistance, avgFScore } = | ||
validationSummaries.modelStats[ | ||
modelKey as keyof typeof validationSummaries.modelStats | ||
]; | ||
const modelStats = validationSummaries.modelStats[ | ||
modelKey as keyof typeof validationSummaries.modelStats | ||
]; | ||
|
||
// Validate required fields exist and are numbers | ||
if ( | ||
typeof modelStats.totalMatches !== 'number' || | ||
typeof modelStats.exactMatches !== 'number' || | ||
typeof validationSummaries.totalQuestions !== 'number' || | ||
validationSummaries.totalQuestions === 0 | ||
) { | ||
console.log(`Invalid validation data for ${modelKey}`); | ||
return 0; | ||
} | ||
|
||
// strong preference for exact, numeric as backup, fscore as minor fallback (it's correlated with jaccard) | ||
return blendScore(avgExactDistance, avgNumericDistance, avgFScore); | ||
const { totalMatches, exactMatches } = modelStats; | ||
|
||
// Calculate individual exactness scores for all questions | ||
const individualScores: number[] = []; | ||
|
||
// Get all question keys from validation results | ||
const questionKeys = Object.keys(validationResults).filter(key => key !== '_summary'); | ||
|
||
// Validate we have questions to process | ||
if (questionKeys.length === 0) { | ||
console.log(`No questions found in validation results for ${modelKey}`); | ||
return 0; | ||
} | ||
|
||
for (const question of questionKeys) { | ||
const individualScore = getExactnessScore(provider, model, question); | ||
individualScores.push(individualScore); | ||
} | ||
|
||
// Calculate average of individual scores (safe division) | ||
const avgIndividualScore = individualScores.length > 0 | ||
? individualScores.reduce((sum, score) => sum + score, 0) / individualScores.length | ||
: 0; | ||
|
||
// Apply exact match bonus (safe division) | ||
const exactMatchRate = exactMatches / validationSummaries.totalQuestions; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Bug: Mismatched Counts Skew Exact Match RateThe |
||
|
||
// Calculate bonus that ensures final score never exceeds 100 | ||
const maxPossibleBonus = Math.max(0, 100 - avgIndividualScore); | ||
const exactMatchBonus = exactMatchRate * Math.min(5, maxPossibleBonus); | ||
|
||
const finalScore = avgIndividualScore + exactMatchBonus; | ||
|
||
// Validate final score is a valid number | ||
if (!isFinite(finalScore)) { | ||
console.log(`Invalid final score calculated for ${modelKey}: ${finalScore}`); | ||
return 0; | ||
} | ||
|
||
return Math.round(finalScore); | ||
} | ||
|
||
/** | ||
* Blends different distance metrics into a single quality score. | ||
* | ||
* @param exact - Exact distance metric (0 = perfect match, 1 = complete mismatch) | ||
* @param numeric - Numeric distance metric (0 = perfect match, 1 = complete mismatch) | ||
* @param fscore - F-score metric (0 = worst, 1 = best) | ||
* @returns Quality score on 0-100 scale (100 = perfect) | ||
*/ | ||
function blendScore(exact: number, numeric: number, fscore: number) { | ||
return 100 * (0.65 * (1 - exact) + 0.25 * (1 - numeric) + 0.1 * fscore); | ||
} | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new implementation includes 0 scores for questions that models never attempted, unfairly lowering their exactness scores compared to the original calculation method.
View Details
📝 Patch Details
Analysis
Unfair scoring penalty in blendedExactnessScore() for models with unattempted questions
What fails: blendedExactnessScore() in src/src/lib/eval.ts iterates through ALL 50 questions and calls getExactnessScore(), which returns 0 for unattempted questions, unfairly lowering model scores compared to using pre-calculated aggregates
How to reproduce:
deepseek/deepseek-chat-v3-0324:free
has 4 unattempted questions)avgExactDistance
,avgNumericDistance
,avgFScore
) only include attempted questionsResult: Models like
deepseek/deepseek-chat-v3-0324:free
get artificially low scores (48 vs 56 points) because unattempted questions count as 0 instead of being excluded from calculationExpected: Use pre-calculated aggregate statistics that only consider questions the model actually attempted, matching the original benchmark methodology that updates stats only when
modelResult
exists