How to evaluate my own generated code?

Hi RepoExec team, 
I'm quite confused about how to use RepoExec dataset to evaluate my own generated code.
Take **load_dataset("Fsoft-AIC/RepoExec")["full_context"][0]** as an example:
After I generated reversed function like below:
```
def reverse(input_string: str) -> str:
    """
    Returns the string with its chars reversed.
    """
    if not is_string(input_string):
        raise InvalidInputError(input_string)
    return input_string[::-1]
```
(the task of **load_dataset("Fsoft-AIC/RepoExec")["full_context"][0]** is to generate **reverse** function right?)
I thought it may be evaluated by "process_results" which defined in lm_eval/tasks/repoexec.py.
In the process_results, generations and references will be compute like below:
```
        code_metric = load("code_eval")
        results, _ = code_metric.compute(
            references=references,
            predictions=generations,
        )
```
In [compute function](https://huggingface.co/spaces/evaluate-metric/code_eval/blob/main/code_eval.py), it concates prediction and reference
```
            for task_id, (candidates, test_case) in enumerate(zip(predictions, references)):
                for candidate in candidates:
                    test_program = candidate + "\n" + test_case
                    args = (test_program, timeout, task_id, completion_id[task_id])
                    future = executor.submit(check_correctness, *args)
```
but there is reversed function already defined in the  load_dataset("Fsoft-AIC/RepoExec")["full_context"][0]["check"]. I've printed the content of check in **test_case.log** and uploaded it.

Should the **reverse** function defined in the dataset overwrite what I've generated while evaluating? In this way my code won't be tested.

Could you please help me solve my question.

Thanks a lot

Lin 

[test_case.log](https://github.com/user-attachments/files/19794277/test_case.log)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to evaluate my own generated code? #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to evaluate my own generated code? #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions