feat: add OLMES variant of BigCodeBench by tfburns · Pull Request #184 · Aleph-Alpha-Research/eval-framework

tfburns · 2026-02-26T09:16:16Z

PR Checklist

Use descriptive commit messages.
Provide tests for your changes.
Update any related documentation and include any relevant screenshots.
Check if changes need to be made to docs (README or any guides in /docs/).

What type of PR is this? (check all applicable)

Description

Adds a variant of the BigCodeBench task which mimics the OLMES implementation.

Added/updated tests?

Yes
No, and this is why: please replace this line with details on why tests
have not been included
I need help with writing tests

src/eval_framework/tasks/benchmarks/bigcodebench.py

tests/tests_eval_framework/tasks/task-prompts-hashes.json

tests/tests_eval_framework/tasks/test_utils.py

…r BigCodeBench_OLMES task

fsschneider

LGTM. Just two small questions.

fsschneider · 2026-03-18T08:17:27Z

src/eval_framework/tasks/benchmarks/bigcodebench.py

+    def __init__(self, num_fewshot: int = 5) -> None:
+        # Default 3-shot; config can override. Enforce 3 for this variant.
+        super().__init__(num_fewshot=3)


I think this is a bit misleading, we default to num_fewshot: int =5, but it is actually overwritten and never used. Also it is silently changed to 3.

fsschneider · 2026-03-18T09:11:29Z

src/eval_framework/metrics/completion/code_execution_pass_at_one.py

+        self.dockerfile = str(importlib.resources.files("eval_framework.tasks") / "Dockerfile_codebench")
+
+    def _count_correct_samples(self, completion: str, context: RealtimeCodeExectionContext) -> tuple[int, str]:
+        dockerfile = str(importlib.resources.files("eval_framework.tasks") / "Dockerfile_codebench")


Is there a reason to duplicate the importlib.resources.files(...) part? Why not use self.dockerfile?

tfburns and others added 5 commits February 26, 2026 08:28

feat: add OLMES variant of BigCodeBench

d9772ec

docs: update readme and BigCodeBench_OLMES docs

4afb0af

feat: cleanup unit tests

ba979e4

fix: prompt hashes for BigCodeBench are non-deterministic

127288b

Merge branch 'main' into big_code_bench

61bb27f

tfburns marked this pull request as ready for review February 26, 2026 14:41

prabhuteja12 reviewed Feb 26, 2026

View reviewed changes

src/eval_framework/tasks/benchmarks/bigcodebench.py Outdated Show resolved Hide resolved

prabhuteja12 reviewed Feb 26, 2026

View reviewed changes

src/eval_framework/tasks/benchmarks/bigcodebench.py Outdated Show resolved Hide resolved

prabhuteja12 reviewed Feb 26, 2026

View reviewed changes

tests/tests_eval_framework/tasks/task-prompts-hashes.json Show resolved Hide resolved

prabhuteja12 reviewed Feb 26, 2026

View reviewed changes

tests/tests_eval_framework/tasks/test_utils.py Show resolved Hide resolved

tfburns and others added 20 commits February 26, 2026 16:01

docs: improved error messaging/logic and test names and docstrings fo…

4546aa3

…r BigCodeBench_OLMES task

Merge branch 'main' into big_code_bench

6e0c19c

docs: update readme with newly-added task

54dc0d7

Merge branch 'main' into big_code_bench

0362de1

convert bytes to string

ef28f49

make mypy happy!

1c43b79

Merge branch 'main' into big_code_bench

6ab7084

removing cue text

9ae6d78

Merge branch 'main' into big_code_bench

011716b

storing

1b7853b

storing

5e9f4d4

more cleanups

d2e513c

feat: adding pool to sandbox

28fe64d

handling bytes conversion

31a341e

readme update

be35eaa

adding dockerfile arguments

fc454af

working code with container reuse

e98c4e6

documentation update

5ab9a62

formatter hashes

b9bb6f8

fixing the timeout exceptions

26e54e0

prabhuteja12 added 6 commits March 12, 2026 21:41

error handling i

ab656d3

Merge branch 'sandboxupgades' into big_code_bench

6d8ac2f

additions to stop sequences, parsing

8c6fde2

Merge remote-tracking branch 'origin/main' into big_code_bench

1c5524a

Merge remote-tracking branch 'origin/main' into big_code_bench

15209b3

update readme

220b310

fsschneider approved these changes Mar 18, 2026

View reviewed changes

prabhuteja12 added 2 commits March 20, 2026 13:37

Merge branch 'main' into big_code_bench

e9cbb93

fixing comments

7086d8a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add OLMES variant of BigCodeBench#184

feat: add OLMES variant of BigCodeBench#184
tfburns wants to merge 33 commits intomainfrom
big_code_bench

tfburns commented Feb 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fsschneider left a comment

Uh oh!

fsschneider Mar 18, 2026

Uh oh!

prabhuteja12 Mar 20, 2026

Uh oh!

fsschneider Mar 18, 2026

Uh oh!

prabhuteja12 Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tfburns commented Feb 26, 2026

PR Checklist

What type of PR is this? (check all applicable)

Description

Added/updated tests?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fsschneider left a comment

Choose a reason for hiding this comment

Uh oh!

fsschneider Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

prabhuteja12 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

fsschneider Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

prabhuteja12 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants