feat: add MultiPL HumanEval & MBPPP tasks by tfburns · Pull Request #189 · Aleph-Alpha-Research/eval-framework

tfburns · 2026-03-02T14:38:41Z

PR Checklist

Use descriptive commit messages.
Provide tests for your changes.
Update any related documentation and include any relevant screenshots.
Check if changes need to be made to docs (README or any guides in /docs/).

What type of PR is this? (check all applicable)

Description

Adds the MultiPL HumanEval & MBPPP tasks, as implemented in the OLMES framework.

Added/updated tests?

Yes
No, and this is why: please replace this line with details on why tests
have not been included
I need help with writing tests

fsschneider

Just a few comments/questions.

fsschneider · 2026-03-05T11:25:21Z

src/eval_framework/tasks/benchmarks/multipl_e.py

+
+MULTIPL_E_STOP_TOKENS: dict[str, list[str]] = {
+    "cpp": ["\n}"],
+    "java": ["\n }\n}"],


OLMES uses different stop tokens for java. Is there a reason for this change:

https://github.com/allenai/olmes/blob/81f804b36e5f00ff8010cf7fcf8a84b821249516/oe_eval/data/multipl_e.py#L52

fsschneider · 2026-03-05T13:07:24Z

src/eval_framework/tasks/benchmarks/multipl_e.py

+class _BaseMPLE_OLMES(BaseTask[str]):
+    """Abstract base for all MultiPL-E OLMES per-language tasks.


Just a general question regarding nomenclature:

Should we have the _OLMES here for all tasks? It seems to me like these are just the regular MultiPL-E tasks, without anything OLMES-specific.

Fixed the names to not have the _OLMES suffix.

fsschneider · 2026-03-05T13:11:54Z

src/eval_framework/tasks/benchmarks/multipl_e.py

+    RESPONSE_TYPE = ResponseType.COMPLETION
+    METRICS = [MultiPLECodeAssertion]
+    SUBJECTS = [NO_SUBJECT]
+    LANGUAGE = None


Perhaps this should be set to English, since all the comments/instructions are in English and therefore required to solve the tasks.

fsschneider · 2026-03-05T13:19:38Z

src/eval_framework/metrics/completion/multipl_e_assertion.py

+    def _execute_via_sandbox_run(self, full_code: str, sandbox_lang: str) -> tuple[bool, str]:
+        """Use llm-sandbox's native session.run() for cpp, java, js."""
+        with SandboxSession(lang=sandbox_lang, keep_template=True, commit_container=False) as session:
+            result: Any = session.run(full_code)
+        output: str = getattr(result, "text", "") or ""
+        exit_code: int = getattr(result, "exit_code", -1)
+        return exit_code == 0, output
+
+    def _execute_via_custom_image(self, full_code: str, cfg: _CustomLangConfig) -> tuple[bool, str]:
+        """For php/rs/sh: copy code into a language-specific Docker image and run manually.
+
+        We open a SandboxSession with a dummy lang (SupportedLanguage.PYTHON) so that
+        llm-sandbox sets up the container plumbing, but we never call session.run().
+        Instead we write the code to a local temp file, copy it into the container with
+        copy_to_runtime, and drive each compile/run command with execute_command.
+        """
+        code_file = f"/tmp/code.{cfg.file_ext}"
+        output = ""
+        exit_code = -1
+
+        with tempfile.NamedTemporaryFile(suffix=f".{cfg.file_ext}", mode="w", delete=False) as tmp:
+            tmp.write(full_code)
+            tmp_path = tmp.name
+
+        try:
+            with SandboxSession(


I saw in the utils that we have added a timeout command:

eval-framework/src/eval_framework/tasks/utils.py

Line 55 in 894e628

if timeout > 0: # hack-add timeout from coreutils to the command executed

Is this something we want to employ here as well? It seems useful as generated code might otherwise hang indefinitely.

fsschneider

LGTM. I don't know all the details about sandboxing, etc. so my code review might have gaps.
I only added one comment of what could perhaps be a missing difference between ours and OLMES, but I haven't checked it.

fsschneider · 2026-03-17T17:11:03Z

src/eval_framework/tasks/benchmarks/multipl_e.py

+        for stop_seq in self.stop_sequences:
+            if stop_seq in completion_text:
+                completion_text = completion_text.split(stop_seq)[0]
+        return completion_text


If I understand it correctly, if we have multiple matches for stop sequences, we check all of them sequentially. I.e. Look for matches with the first stop sequence in the list, then check the next on the (possibly) trimmed completion.

So if we have multiple matches, the returned completion depends on the order of self.stop_sequences, right? Could that be a difference to the OLMES implementation? E.g. either different order there, or they try to find the earliest match?

tfburns added 4 commits March 2, 2026 14:22

feat: add MultiPL HumanEval task

40ee77a

feat: add MultiPLEMBPP tasks

abb6057

docs: update readme and add MultiPLE task docs

aacf689

fix: mypy error in multip_e.py

1e133c7

tfburns marked this pull request as ready for review March 3, 2026 09:38

fsschneider self-requested a review March 5, 2026 08:36

fsschneider approved these changes Mar 5, 2026

View reviewed changes

prabhuteja12 added 22 commits March 6, 2026 10:25

Merge remote-tracking branch 'origin/main' into multipl_e

0793888

changing stop tokens

a27736c

bug fixes with execution engine

4cdd531

readme updates

e32cac6

mypy

870171f

sandbox upgrades

e60b74f

storing all changes

476a287

Merge remote-tracking branch 'origin/main' into multipl_e

3ab3f1a

cleanup

1123ff1

mypy happy

a559937

passing tests

72155e1

unit test fixes

53adc40

feat: adding pool to sandbox

28fe64d

handling bytes conversion

31a341e

Merge branch 'sandboxupgades' into multipl_e

ce81f33

adding dockerfile arguments

fc454af

Merge branch 'sandboxupgades' into multipl_e

c60e10d

multiple name fixes

fe6adcd

doc updates

72bfb3a

Merge remote-tracking branch 'origin/main' into multipl_e

40bf8e7

readme counter update

d9ce3f8

stop token fixes

56231e2

fsschneider approved these changes Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add MultiPL HumanEval & MBPPP tasks#189

feat: add MultiPL HumanEval & MBPPP tasks#189
tfburns wants to merge 26 commits intomainfrom
multipl_e

tfburns commented Mar 2, 2026

Uh oh!

fsschneider left a comment

Uh oh!

fsschneider Mar 5, 2026

Uh oh!

prabhuteja12 Mar 10, 2026

Uh oh!

fsschneider Mar 5, 2026

Uh oh!

prabhuteja12 Mar 12, 2026

Uh oh!

fsschneider Mar 5, 2026

Uh oh!

fsschneider Mar 5, 2026

Uh oh!

prabhuteja12 Mar 10, 2026

Uh oh!

fsschneider left a comment

Uh oh!

fsschneider Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		class _BaseMPLE_OLMES(BaseTask[str]):
		"""Abstract base for all MultiPL-E OLMES per-language tasks.

Conversation

tfburns commented Mar 2, 2026

PR Checklist

What type of PR is this? (check all applicable)

Description

Added/updated tests?

Uh oh!

fsschneider left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fsschneider left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants