fix: match OLMES for GenQA tasks by fsschneider · Pull Request #195 · Aleph-Alpha-Research/eval-framework

fsschneider · 2026-03-13T13:18:03Z

PR Checklist

Use descriptive commit messages.
Provide tests for your changes.
Update any related documentation and include any relevant screenshots.
Check if changes need to be made to docs (README or any guides in /docs/).

What type of PR is this? (check all applicable)

Description

PR to match GenQA tasks with OLMES:

New task variants: HELLASWAG_OLMES (train split), SQuAD_OLMES (OLMES-style prompt with SQuAD-normalized F1), WINOGRANDECloze (partial-evaluation cloze with custom metric)
Updated tasks: NaturalQsOpen (OLMES prompt format, DROP F1/EM metric, fixed fewshot target formatting bug), DropCompletion_OLMES (added reading comprehension initial prompt)
New metrics: F1SquadNormalized (SQuAD-style F1 with article/punctuation removal), PartialEvalAccuracy (stateful metric pairing two samples per Winogrande item to compute p(suffix | prefix + option))
Refactored F1 base class: Extracted normalize() and tokenize() hooks for subclass customization

Added/updated tests?

Yes
No, and this is why: please replace this line with details on why tests
have not been included
I need help with writing tests

src/eval_framework/metrics/loglikelihood/accuracy_loglikelihood.py

prabhuteja12 · 2026-03-16T12:28:23Z

src/eval_framework/metrics/loglikelihood/accuracy_loglikelihood.py

+            # Both samples exist, calculate the accuracy
+            other_logprob, other_is_correct = self._pending.pop(item_id)
+            # Verify that only one of the samples is correct
+            assert other_is_correct != is_correct, "Both samples cannot be correct or incorrect at the same time"


Is this a check of the dataset itself?

No, I rather think of it as sanity-checking that the computation in the task itself works correctly.

src/eval_framework/metrics/loglikelihood/accuracy_loglikelihood.py

Co-authored-by: Prabhu Teja <prabhu.sivaprasad@aleph-alpha-research.com>

fsschneider added 10 commits March 13, 2026 13:05

feat: add squad-style normalized f1 metric

fa84a1a

feat: add OLMES-matching SQuAD task

660a16a

feat: add OLMES-matching HELLASWAG task

9f0efb0

fix: handle multiple ground-truths

f3bff64

fix: add missing initial prompt

ffa56b4

feat: add Winogrande with custom partial-eval metric

f59b0b3

fix: register new tasks

58439b3

test: update hashes

4760701

docs: update docs with new tasks and modifications

001cbe9

style: fix mypy missing return type annotation

e0d47a5

fsschneider requested a review from prabhuteja12 March 13, 2026 15:28

prabhuteja12 reviewed Mar 16, 2026

View reviewed changes

src/eval_framework/metrics/loglikelihood/accuracy_loglikelihood.py Show resolved Hide resolved

prabhuteja12 reviewed Mar 16, 2026

View reviewed changes

src/eval_framework/metrics/loglikelihood/accuracy_loglikelihood.py Outdated Show resolved Hide resolved

refactor: tighter assignement

05b2e24

Co-authored-by: Prabhu Teja <prabhu.sivaprasad@aleph-alpha-research.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: match OLMES for GenQA tasks#195

fix: match OLMES for GenQA tasks#195
fsschneider wants to merge 11 commits intomainfrom
OLMES_matching_genqa

fsschneider commented Mar 13, 2026

Uh oh!

Uh oh!

prabhuteja12 Mar 16, 2026

Uh oh!

fsschneider Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fsschneider commented Mar 13, 2026

PR Checklist

What type of PR is this? (check all applicable)

Description

Added/updated tests?

Uh oh!

Uh oh!

prabhuteja12 Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

fsschneider Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants