- 
                Notifications
    You must be signed in to change notification settings 
- Fork 369
Adds inspectai #1022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Adds inspectai #1022
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds integration with the Inspect AI framework to enable evaluation using Inspect AI's data model and scorers. It introduces support for new task configuration fields (solver, scorer, sample_fields, sample_to_fewshot, filter) and implements Inspect AI-compatible scorers for math and multiple-choice evaluations.
Key Changes
- Added Inspect AI-compatible scorers (math_scorer,multichoice_scorer) and custom task scorers for IFEval and IFBench
- Introduced new task configuration fields to support Inspect AI's Sample-based evaluation flow
- Modified the get_extraction_regexesfunction signature to acceptlen_choicesas a parameter instead of extracting it from aDocobject
- Added InspectAIModelConfigclass to support Inspect AI model configuration
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description | 
|---|---|
| src/lighteval/tasks/tasks/ifeval/main.py | Adds Inspect AI scorer and sample conversion for IFEval task evaluation | 
| src/lighteval/tasks/tasks/ifbench/main.py | Adds Inspect AI scorer and sample conversion for IFBench task evaluation | 
| src/lighteval/tasks/tasks/hle/main.py | Adds Inspect AI sample conversion and model-graded fact checker for HLE task | 
| src/lighteval/tasks/tasks/gsm_plus.py | Adds math scorer with prompt template and sample conversion for GSM Plus task | 
| src/lighteval/tasks/tasks/gsm8k.py | Adds math scorer with prompt template and sample conversion for GSM8K task | 
| src/lighteval/tasks/tasks/gpqa.py | Adds multiple-choice solver and choice scorer with random answer shuffling for GPQA task | 
| src/lighteval/tasks/tasks/aime.py | Adds math scorer with prompt template and sample conversion for AIME task | 
| src/lighteval/tasks/tasks/agieval.py | Adds multiple-choice solver and choice scorer with sample conversion for AGIEval task | 
| src/lighteval/tasks/lighteval_task.py | Adds Inspect AI compatible configuration fields to LightevalTaskConfig | 
| src/lighteval/models/abstract_model.py | Adds InspectAIModelConfig class for Inspect AI model configuration | 
| src/lighteval/metrics/utils/extractive_match_utils.py | Refactors get_extraction_regexesto acceptlen_choicesparameter instead ofDocobject | 
| src/lighteval/metrics/metrics.py | Implements math_scorerandmultichoice_scorerfor Inspect AI integration | 
| src/lighteval/main.py | Adds Inspect AI evaluation backend command | 
Comments suppressed due to low confidence (1)
src/lighteval/metrics/metrics.py:108
- This statement is unreachable.
        return Score(value=1)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
…hteval into nathan-move-to-inspectai
| The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need a review at this stage?
|  | ||
| def record_to_sample(record): | ||
| return Sample( | ||
| input=record["question"], | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're not using the same prompt as in the base eval (with "Question: ... " etc appended)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the system prompt they are using:
https://github.com/centerforaisafety/hle/blob/main/hle_eval/run_model_predictions.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then you need to revert ours for consistency
| 
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, a couple nits atm, but cool work, looking forward to how the code base will simplify with inspect!
You're also missing a doc update for the whole new feature
| target.text, gold_extraction_regexes, fallback_mode, extraction_mode, timeout_seconds | ||
| ) | ||
| return Score( | ||
| value="C" if extracted_predictions == extracted_gold else "I", | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain this line
| def multichoice_scorer(): | ||
| language = Language.ENGLISH | ||
| gold_extraction_target = ( | ||
| IndicesExtractionConfig(prefix_for_extraction="NativeLetters", try_extract_without_anchor=True), | ||
| ) | ||
| pred_extraction_target = ( | ||
| IndicesExtractionConfig(prefix_for_extraction="NativeLetters", try_extract_without_anchor=True), | ||
| ) | ||
| fallback_mode = "first_match" | ||
| extraction_mode = "first_match" | ||
| timeout_seconds = 5 | ||
|  | ||
| gold_extraction_regexes = get_extraction_regexes_inspect(gold_extraction_target, language) | ||
| pred_extraction_regexes = get_extraction_regexes_inspect(pred_extraction_target, language) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this behavior of nested functions behaving as classes is really meh for legibility, customizability and maintenability
| max_tokens: int | None = None | ||
| system_message: str | None = None | ||
| temperature: float | None = None | ||
| top_p: float | None = None | ||
| top_k: int | None = None | ||
| frequence_penalty: float | None = None | ||
| presence_penalty: float | None = None | ||
| seed: int | None = None | ||
| stop_seqs: list[str] | None = None | ||
| num_choices: int | None = None | ||
| best_of: int | None = None | ||
| log_probs: bool | None = None | ||
| top_logprobs: int | None = None | ||
| cache_prompt: bool | None = None | ||
| reasoning_effort: int | None = None | ||
| reasoning_tokens: int | None = None | ||
| reasoning_history: bool | None = None | ||
| response_format: str | None = None | ||
| parallel_tool_calls: bool | None = None | ||
| max_tool_output: int | None = None | ||
| internal_tools: bool | None = None | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we factorize with the other model classes?
| ) | ||
| def ifeval_scorer(): | ||
| async def score(state: TaskState, target: Target): | ||
| response = state.output.completion | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would probably be simpler if you put the preprocessing in its own function, so you can easily read the score
| def record_to_sample(record): | ||
| # we need to remove prepended (A), (B), (C), (D) from the choices | ||
| choices = [ | ||
| c.replace("(A)", "").replace("(B)", "").replace("(C)", "").replace("(D)", "").strip() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm did you check if there are also spaces after the letters? "(A) Sentence" for ex ?
| return Sample( | ||
| input=record["Question"].strip(), | ||
| choices=choices, | ||
| target=ascii_uppercase[gold_index], | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we used to use LETTERS in place of this, LETTERS can probably be removed from the tasks then
|  | ||
| # Inspect AI compatible parameters | ||
| solver: None = None | ||
| scorer: None = None | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would make sense to factorize to avoid having 2 different ways to launch evals, it will mess up the source of truth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factorize with current metrics? it would be quite messy i think. having it separate for now seems better as we will in term only use the scorer mechanic!
| def results_to_markdown_table( | ||
| results_per_model_per_task, | ||
| metric: str = "accuracy", | ||
| stderr_metric: str = "stderr", | ||
| max_total_columns: int | None = None, | ||
| means_only_task_threshold: int = 10, | ||
| ) -> str: | ||
| cols = _collect_columns(results_per_model_per_task, means_only_task_threshold, max_total_columns) | ||
|  | ||
| writer = MarkdownTableWriter() | ||
| writer.headers = ["Model"] + cols | ||
|  | ||
| rows = [] | ||
| for model in sorted(results_per_model_per_task.keys()): | ||
| row = [model] | ||
| data = results_per_model_per_task[model] | ||
| for col in cols: | ||
| row.append(_format_metric_cell(data, col, metric, stderr_metric)) | ||
| rows.append(row) | ||
|  | ||
| writer.value_matrix = rows | ||
| return writer.dumps() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you reuse the output functions we already have?
adds inspect-ai as backend for lighteval! Offloading backend implementation and maintenance
tasks compatible with inspect ai (at term all the tasks will be compatible):
run llama3.1-8b using all providers on
hf-inference-providersongpqa,agievalandaime25:result:
compare few shots diff on gsm8k