Conversation
fsschneider
left a comment
There was a problem hiding this comment.
Just a few comments/questions.
|
|
||
| MULTIPL_E_STOP_TOKENS: dict[str, list[str]] = { | ||
| "cpp": ["\n}"], | ||
| "java": ["\n }\n}"], |
There was a problem hiding this comment.
OLMES uses different stop tokens for java. Is there a reason for this change:
| class _BaseMPLE_OLMES(BaseTask[str]): | ||
| """Abstract base for all MultiPL-E OLMES per-language tasks. |
There was a problem hiding this comment.
Just a general question regarding nomenclature:
Should we have the _OLMES here for all tasks? It seems to me like these are just the regular MultiPL-E tasks, without anything OLMES-specific.
There was a problem hiding this comment.
Fixed the names to not have the _OLMES suffix.
| RESPONSE_TYPE = ResponseType.COMPLETION | ||
| METRICS = [MultiPLECodeAssertion] | ||
| SUBJECTS = [NO_SUBJECT] | ||
| LANGUAGE = None |
There was a problem hiding this comment.
Perhaps this should be set to English, since all the comments/instructions are in English and therefore required to solve the tasks.
| def _execute_via_sandbox_run(self, full_code: str, sandbox_lang: str) -> tuple[bool, str]: | ||
| """Use llm-sandbox's native session.run() for cpp, java, js.""" | ||
| with SandboxSession(lang=sandbox_lang, keep_template=True, commit_container=False) as session: | ||
| result: Any = session.run(full_code) | ||
| output: str = getattr(result, "text", "") or "" | ||
| exit_code: int = getattr(result, "exit_code", -1) | ||
| return exit_code == 0, output | ||
|
|
||
| def _execute_via_custom_image(self, full_code: str, cfg: _CustomLangConfig) -> tuple[bool, str]: | ||
| """For php/rs/sh: copy code into a language-specific Docker image and run manually. | ||
|
|
||
| We open a SandboxSession with a dummy lang (SupportedLanguage.PYTHON) so that | ||
| llm-sandbox sets up the container plumbing, but we never call session.run(). | ||
| Instead we write the code to a local temp file, copy it into the container with | ||
| copy_to_runtime, and drive each compile/run command with execute_command. | ||
| """ | ||
| code_file = f"/tmp/code.{cfg.file_ext}" | ||
| output = "" | ||
| exit_code = -1 | ||
|
|
||
| with tempfile.NamedTemporaryFile(suffix=f".{cfg.file_ext}", mode="w", delete=False) as tmp: | ||
| tmp.write(full_code) | ||
| tmp_path = tmp.name | ||
|
|
||
| try: | ||
| with SandboxSession( |
There was a problem hiding this comment.
I saw in the utils that we have added a timeout command:
Is this something we want to employ here as well? It seems useful as generated code might otherwise hang indefinitely.
fsschneider
left a comment
There was a problem hiding this comment.
LGTM. I don't know all the details about sandboxing, etc. so my code review might have gaps.
I only added one comment of what could perhaps be a missing difference between ours and OLMES, but I haven't checked it.
| for stop_seq in self.stop_sequences: | ||
| if stop_seq in completion_text: | ||
| completion_text = completion_text.split(stop_seq)[0] | ||
| return completion_text |
There was a problem hiding this comment.
If I understand it correctly, if we have multiple matches for stop sequences, we check all of them sequentially. I.e. Look for matches with the first stop sequence in the list, then check the next on the (possibly) trimmed completion.
So if we have multiple matches, the returned completion depends on the order of self.stop_sequences, right? Could that be a difference to the OLMES implementation? E.g. either different order there, or they try to find the earliest match?
PR Checklist
/docs/).What type of PR is this? (check all applicable)
Description
Adds the MultiPL HumanEval & MBPPP tasks, as implemented in the OLMES framework.
Added/updated tests?
have not been included