Skip to content

fix: match OLMES for GenQA tasks#195

Open
fsschneider wants to merge 11 commits intomainfrom
OLMES_matching_genqa
Open

fix: match OLMES for GenQA tasks#195
fsschneider wants to merge 11 commits intomainfrom
OLMES_matching_genqa

Conversation

@fsschneider
Copy link
Contributor

PR Checklist

  • Use descriptive commit messages.
  • Provide tests for your changes.
  • Update any related documentation and include any relevant screenshots.
  • Check if changes need to be made to docs (README or any guides in /docs/).

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update

Description

PR to match GenQA tasks with OLMES:

  • New task variants: HELLASWAG_OLMES (train split), SQuAD_OLMES (OLMES-style prompt with SQuAD-normalized F1), WINOGRANDECloze (partial-evaluation cloze with custom metric)
  • Updated tasks: NaturalQsOpen (OLMES prompt format, DROP F1/EM metric, fixed fewshot target formatting bug), DropCompletion_OLMES (added reading comprehension initial prompt)
  • New metrics: F1SquadNormalized (SQuAD-style F1 with article/punctuation removal), PartialEvalAccuracy (stateful metric pairing two samples per Winogrande item to compute p(suffix | prefix + option))
  • Refactored F1 base class: Extracted normalize() and tokenize() hooks for subclass customization

Added/updated tests?

  • Yes
  • No, and this is why: please replace this line with details on why tests
    have not been included
  • I need help with writing tests

# Both samples exist, calculate the accuracy
other_logprob, other_is_correct = self._pending.pop(item_id)
# Verify that only one of the samples is correct
assert other_is_correct != is_correct, "Both samples cannot be correct or incorrect at the same time"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a check of the dataset itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I rather think of it as sanity-checking that the computation in the task itself works correctly.

Co-authored-by: Prabhu Teja <prabhu.sivaprasad@aleph-alpha-research.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants