Skip to content

feat: add OLMES variant of BigCodeBench#184

Open
tfburns wants to merge 33 commits intomainfrom
big_code_bench
Open

feat: add OLMES variant of BigCodeBench#184
tfburns wants to merge 33 commits intomainfrom
big_code_bench

Conversation

@tfburns
Copy link
Collaborator

@tfburns tfburns commented Feb 26, 2026

PR Checklist

  • Use descriptive commit messages.
  • Provide tests for your changes.
  • Update any related documentation and include any relevant screenshots.
  • Check if changes need to be made to docs (README or any guides in /docs/).

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update

Description

Adds a variant of the BigCodeBench task which mimics the OLMES implementation.

Added/updated tests?

  • Yes
  • No, and this is why: please replace this line with details on why tests
    have not been included
  • I need help with writing tests

@tfburns tfburns marked this pull request as ready for review February 26, 2026 14:41
Copy link
Contributor

@fsschneider fsschneider left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just two small questions.

Comment on lines +131 to +133
def __init__(self, num_fewshot: int = 5) -> None:
# Default 3-shot; config can override. Enforce 3 for this variant.
super().__init__(num_fewshot=3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit misleading, we default to num_fewshot: int =5, but it is actually overwritten and never used. Also it is silently changed to 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines +143 to +146
self.dockerfile = str(importlib.resources.files("eval_framework.tasks") / "Dockerfile_codebench")

def _count_correct_samples(self, completion: str, context: RealtimeCodeExectionContext) -> tuple[int, str]:
dockerfile = str(importlib.resources.files("eval_framework.tasks") / "Dockerfile_codebench")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to duplicate the importlib.resources.files(...) part? Why not use self.dockerfile?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants