Adds inspectai #1022

NathanHB · 2025-10-20T19:15:38Z

adds inspect-ai as backend for lighteval! Offloading backend implementation and maintenance

this allows for:
better logs
better paralelixzation
easier to add tasks

tasks compatible with inspect ai (at term all the tasks will be compatible):

gpqa (fewshot compatible)
ifeval
hle
gsm8k (fewshot compatible)
agieval
aime24,25

run llama3.1-8b using all providers on `hf-inference-providers` on `gpqa`, `agieval` and `aime25`:

lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \
"lighteval|gpqa|0,lighteval|agieval|0,lighteval|aime25|0" \
max-connections 50 --timeout 30  --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1

result:

|                                Model                                 |agieval|aime25|gpqa|
|----------------------------------------------------------------------|------:|-----:|---:|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras      |   0.53|     0|0.33|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai|   0.71|     1|0.75|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai  |   0.71|     0|0.25|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius        |   0.53|     0|0.20|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita        |   0.65|     0|0.75|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova     |   0.71|     0|0.25|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway      |   0.35|     0|0.25|

compare few shots diff on gsm8k

lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \
"lighteval|gsm8k|0,lighteval|gsm8k|3" \
max-connections 50 --timeout 30  --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1

|                                Model                                 |gsm8k|gsm8k_3_shots|
|----------------------------------------------------------------------|----:|------------:|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras      |  0.6|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai|  0.7|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai  |  0.7|          0.8|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius        |  0.6|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita        |  0.5|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova     |  0.7|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway      |  0.4|          0.8|

Copilot

Pull Request Overview

This PR adds integration with the Inspect AI framework to enable evaluation using Inspect AI's data model and scorers. It introduces support for new task configuration fields (solver, scorer, sample_fields, sample_to_fewshot, filter) and implements Inspect AI-compatible scorers for math and multiple-choice evaluations.

Key Changes

Added Inspect AI-compatible scorers (math_scorer, multichoice_scorer) and custom task scorers for IFEval and IFBench
Introduced new task configuration fields to support Inspect AI's Sample-based evaluation flow
Modified the get_extraction_regexes function signature to accept len_choices as a parameter instead of extracting it from a Doc object
Added InspectAIModelConfig class to support Inspect AI model configuration

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/lighteval/tasks/tasks/ifeval/main.py	Adds Inspect AI scorer and sample conversion for IFEval task evaluation
src/lighteval/tasks/tasks/ifbench/main.py	Adds Inspect AI scorer and sample conversion for IFBench task evaluation
src/lighteval/tasks/tasks/hle/main.py	Adds Inspect AI sample conversion and model-graded fact checker for HLE task
src/lighteval/tasks/tasks/gsm_plus.py	Adds math scorer with prompt template and sample conversion for GSM Plus task
src/lighteval/tasks/tasks/gsm8k.py	Adds math scorer with prompt template and sample conversion for GSM8K task
src/lighteval/tasks/tasks/gpqa.py	Adds multiple-choice solver and choice scorer with random answer shuffling for GPQA task
src/lighteval/tasks/tasks/aime.py	Adds math scorer with prompt template and sample conversion for AIME task
src/lighteval/tasks/tasks/agieval.py	Adds multiple-choice solver and choice scorer with sample conversion for AGIEval task
src/lighteval/tasks/lighteval_task.py	Adds Inspect AI compatible configuration fields to LightevalTaskConfig
src/lighteval/models/abstract_model.py	Adds InspectAIModelConfig class for Inspect AI model configuration
src/lighteval/metrics/utils/extractive_match_utils.py	Refactors `get_extraction_regexes` to accept `len_choices` parameter instead of `Doc` object
src/lighteval/metrics/metrics.py	Implements `math_scorer` and `multichoice_scorer` for Inspect AI integration
src/lighteval/main.py	Adds Inspect AI evaluation backend command

Comments suppressed due to low confidence (1)

src/lighteval/metrics/metrics.py:108

This statement is unreachable.

        return Score(value=1)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/lighteval/metrics/metrics.py

src/lighteval/__main__.py

src/lighteval/models/abstract_model.py

src/lighteval/metrics/metrics.py

Co-authored-by: Copilot <[email protected]>

…hteval into nathan-move-to-inspectai

HuggingFaceDocBuilderDev · 2025-10-30T12:34:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…hteval into nathan-move-to-inspectai

clefourrier

do you need a review at this stage?

src/lighteval/metrics/utils/extractive_match_utils.py

src/lighteval/metrics/metrics.py

clefourrier · 2025-10-30T14:50:49Z

src/lighteval/tasks/tasks/hle/main.py

+
+def record_to_sample(record):
+    return Sample(
+        input=record["question"],


you're not using the same prompt as in the base eval (with "Question: ... " etc appended)

Added the system prompt they are using:
https://github.com/centerforaisafety/hle/blob/main/hle_eval/run_model_predictions.py

then you need to revert ours for consistency

NathanHB · 2025-10-30T15:38:08Z

do you need a review at this stage?
yes ! it's running, and even though we do have every tasks and every bells and whistles it should be good to review

clefourrier

lgtm, a couple nits atm, but cool work, looking forward to how the code base will simplify with inspect!

You're also missing a doc update for the whole new feature

clefourrier · 2025-10-31T12:02:24Z

src/lighteval/metrics/metrics.py

+            target.text, gold_extraction_regexes, fallback_mode, extraction_mode, timeout_seconds
+        )
+        return Score(
+            value="C" if extracted_predictions == extracted_gold else "I",


explain this line

clefourrier · 2025-10-31T12:03:29Z

src/lighteval/metrics/metrics.py

+def multichoice_scorer():
+    language = Language.ENGLISH
+    gold_extraction_target = (
+        IndicesExtractionConfig(prefix_for_extraction="NativeLetters", try_extract_without_anchor=True),
+    )
+    pred_extraction_target = (
+        IndicesExtractionConfig(prefix_for_extraction="NativeLetters", try_extract_without_anchor=True),
+    )
+    fallback_mode = "first_match"
+    extraction_mode = "first_match"
+    timeout_seconds = 5
+
+    gold_extraction_regexes = get_extraction_regexes_inspect(gold_extraction_target, language)
+    pred_extraction_regexes = get_extraction_regexes_inspect(pred_extraction_target, language)


this behavior of nested functions behaving as classes is really meh for legibility, customizability and maintenability

clefourrier · 2025-10-31T12:04:25Z

src/lighteval/models/abstract_model.py

+    max_tokens: int | None = None
+    system_message: str | None = None
+    temperature: float | None = None
+    top_p: float | None = None
+    top_k: int | None = None
+    frequence_penalty: float | None = None
+    presence_penalty: float | None = None
+    seed: int | None = None
+    stop_seqs: list[str] | None = None
+    num_choices: int | None = None
+    best_of: int | None = None
+    log_probs: bool | None = None
+    top_logprobs: int | None = None
+    cache_prompt: bool | None = None
+    reasoning_effort: int | None = None
+    reasoning_tokens: int | None = None
+    reasoning_history: bool | None = None
+    response_format: str | None = None
+    parallel_tool_calls: bool | None = None
+    max_tool_output: int | None = None
+    internal_tools: bool | None = None


could we factorize with the other model classes?

clefourrier · 2025-10-31T12:07:10Z

src/lighteval/tasks/tasks/ifeval/main.py

+)
+def ifeval_scorer():
+    async def score(state: TaskState, target: Target):
+        response = state.output.completion


would probably be simpler if you put the preprocessing in its own function, so you can easily read the score

clefourrier · 2025-10-31T12:08:11Z

src/lighteval/tasks/tasks/agieval.py

+def record_to_sample(record):
+    # we need to remove prepended (A), (B), (C), (D) from the choices
+    choices = [
+        c.replace("(A)", "").replace("(B)", "").replace("(C)", "").replace("(D)", "").strip()


hm did you check if there are also spaces after the letters? "(A) Sentence" for ex ?

clefourrier · 2025-10-31T12:09:17Z

src/lighteval/tasks/tasks/gpqa.py

+    return Sample(
+        input=record["Question"].strip(),
+        choices=choices,
+        target=ascii_uppercase[gold_index],


we used to use LETTERS in place of this, LETTERS can probably be removed from the tasks then

clefourrier · 2025-10-31T12:10:38Z

src/lighteval/tasks/lighteval_task.py


+    # Inspect AI compatible parameters
+    solver: None = None
+    scorer: None = None


Would make sense to factorize to avoid having 2 different ways to launch evals, it will mess up the source of truth

factorize with current metrics? it would be quite messy i think. having it separate for now seems better as we will in term only use the scorer mechanic!

clefourrier · 2025-10-31T12:11:16Z

src/lighteval/main_inspect.py

+def results_to_markdown_table(
+    results_per_model_per_task,
+    metric: str = "accuracy",
+    stderr_metric: str = "stderr",
+    max_total_columns: int | None = None,
+    means_only_task_threshold: int = 10,
+) -> str:
+    cols = _collect_columns(results_per_model_per_task, means_only_task_threshold, max_total_columns)
+
+    writer = MarkdownTableWriter()
+    writer.headers = ["Model"] + cols
+
+    rows = []
+    for model in sorted(results_per_model_per_task.keys()):
+        row = [model]
+        data = results_per_model_per_task[model]
+        for col in cols:
+            row.append(_format_metric_cell(data, col, metric, stderr_metric))
+        rows.append(row)
+
+    writer.value_matrix = rows
+    return writer.dumps()


could you reuse the output functions we already have?

NathanHB and others added 30 commits October 7, 2025 15:09

use inspect-ai to evaluate aime25 and gsm8k

2696a49

revert file

578d530

working for 3 tasks

21fa870

parallel evals of tasks

27b2af1

adds gpqa diamond to inspect

b9a610d

move tasks to individual files

25c1128

move tasks to individual files

0d42edf

enable extended tasks as well

6cc3c04

run precomit hook

4c38951

fix mkqa

d2fd5e1

chaange extended suite to lighteval

2ddb0f9

chaange extended suite to lighteval

ee97122

add metdata to tasks

e2c8e22

add metdata to tasks

c980ddb

remove license notice and put docstring on top of file

57fe390

homogenize tags

ee081f2

add docstring for all multilingual tasks

1ed1602

add docstring for all multilingual tasks

f4b0e27

add name and dataset to metadata

81d9e4e

use TASKS_TABLE for multilingual tasks

b734532

use TASKS_TABLE for default tasks

c3911fc

use TASKS_TABLE for default tasks

e439f70

loads all tasks correclty

6447ee7

move community tasks to default tasks and update doc

88754bf

move community tasks to default tasks and update doc

5445f5c

Merge remote-tracking branch 'origin/main' into nathan-reorg-tasks

f53bd76

revert uneeded changes

6a0c615

fix doc build

1435e38

fix doc build

15f41f2

remove custom tasks and let user decide if loading multilingual tasks

74e5c0f

NathanHB and others added 6 commits October 17, 2025 11:28

fix tasks

81081cd

fix tasks

74b40f6

fix tasks

083fb1b

fix tests

2dab2bf

fix tests

57ca0e5

add inspect-ai

480e40a

NathanHB changed the base branch from main to nathan-reorg-tasks October 20, 2025 19:15

NathanHB marked this pull request as draft October 20, 2025 19:16

NathanHB added 3 commits October 29, 2025 10:23

add tasks

ade2900

add gpqa

079ceaf

make model config work

8d00799

NathanHB marked this pull request as ready for review October 29, 2025 11:05

NathanHB requested a review from Copilot October 29, 2025 11:08

Copilot AI reviewed Oct 29, 2025

View reviewed changes

NathanHB and others added 3 commits October 29, 2025 12:12

Update src/lighteval/metrics/metrics.py

cea5e99

Co-authored-by: Copilot <[email protected]>

init

fb47bb7

Merge branch 'nathan-move-to-inspectai' of github.com:huggingface/lig…

2736bc9

…hteval into nathan-move-to-inspectai

NathanHB changed the base branch from nathan-reorg-tasks to main October 30, 2025 12:37

NathanHB and others added 6 commits October 30, 2025 13:41

Merge branch 'main' into nathan-move-to-inspectai

d5e6c9f

fix tests

e55a9af

Merge branch 'nathan-move-to-inspectai' of github.com:huggingface/lig…

ba41f1c

…hteval into nathan-move-to-inspectai

fix tests

59c5dcc

fix tests

40254db

fix tests

53275fe

clefourrier reviewed Oct 30, 2025

View reviewed changes

NathanHB added 2 commits October 30, 2025 16:36

add correct system prompt for hle

72e5c2b

add correct system prompt for hle

7fc1753

clefourrier approved these changes Oct 31, 2025

View reviewed changes

Adds inspectai #1022

Are you sure you want to change the base?

Adds inspectai #1022

Uh oh!

Conversation

NathanHB commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

run llama3.1-8b using all providers on hf-inference-providers on gpqa, agieval and aime25:

compare few shots diff on gsm8k

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Oct 30, 2025

Uh oh!

clefourrier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NathanHB commented Oct 30, 2025

Uh oh!

clefourrier left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NathanHB commented Oct 20, 2025 •

edited

Loading

run llama3.1-8b using all providers on `hf-inference-providers` on `gpqa`, `agieval` and `aime25`:

clefourrier left a comment •

edited

Loading