Skip to content

Conversation

@NathanHB
Copy link
Member

@NathanHB NathanHB commented Oct 20, 2025

adds inspect-ai as backend for lighteval! Offloading backend implementation and maintenance

  • this allows for:
  • better logs
  • better paralelixzation
  • easier to add tasks

tasks compatible with inspect ai (at term all the tasks will be compatible):

  • gpqa (fewshot compatible)
  • ifeval
  • hle
  • gsm8k (fewshot compatible)
  • agieval
  • aime24,25

run llama3.1-8b using all providers on hf-inference-providers on gpqa, agieval and aime25:

lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \
"lighteval|gpqa|0,lighteval|agieval|0,lighteval|aime25|0" \
max-connections 50 --timeout 30  --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1

result:

|                                Model                                 |agieval|aime25|gpqa|
|----------------------------------------------------------------------|------:|-----:|---:|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras      |   0.53|     0|0.33|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai|   0.71|     1|0.75|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai  |   0.71|     0|0.25|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius        |   0.53|     0|0.20|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita        |   0.65|     0|0.75|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova     |   0.71|     0|0.25|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway      |   0.35|     0|0.25|

compare few shots diff on gsm8k

lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \
"lighteval|gsm8k|0,lighteval|gsm8k|3" \
max-connections 50 --timeout 30  --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1
|                                Model                                 |gsm8k|gsm8k_3_shots|
|----------------------------------------------------------------------|----:|------------:|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras      |  0.6|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai|  0.7|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai  |  0.7|          0.8|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius        |  0.6|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita        |  0.5|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova     |  0.7|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway      |  0.4|          0.8|

@NathanHB NathanHB changed the base branch from main to nathan-reorg-tasks October 20, 2025 19:15
@NathanHB NathanHB marked this pull request as draft October 20, 2025 19:16
@NathanHB NathanHB marked this pull request as ready for review October 29, 2025 11:05
@NathanHB NathanHB requested a review from Copilot October 29, 2025 11:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds integration with the Inspect AI framework to enable evaluation using Inspect AI's data model and scorers. It introduces support for new task configuration fields (solver, scorer, sample_fields, sample_to_fewshot, filter) and implements Inspect AI-compatible scorers for math and multiple-choice evaluations.

Key Changes

  • Added Inspect AI-compatible scorers (math_scorer, multichoice_scorer) and custom task scorers for IFEval and IFBench
  • Introduced new task configuration fields to support Inspect AI's Sample-based evaluation flow
  • Modified the get_extraction_regexes function signature to accept len_choices as a parameter instead of extracting it from a Doc object
  • Added InspectAIModelConfig class to support Inspect AI model configuration

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/lighteval/tasks/tasks/ifeval/main.py Adds Inspect AI scorer and sample conversion for IFEval task evaluation
src/lighteval/tasks/tasks/ifbench/main.py Adds Inspect AI scorer and sample conversion for IFBench task evaluation
src/lighteval/tasks/tasks/hle/main.py Adds Inspect AI sample conversion and model-graded fact checker for HLE task
src/lighteval/tasks/tasks/gsm_plus.py Adds math scorer with prompt template and sample conversion for GSM Plus task
src/lighteval/tasks/tasks/gsm8k.py Adds math scorer with prompt template and sample conversion for GSM8K task
src/lighteval/tasks/tasks/gpqa.py Adds multiple-choice solver and choice scorer with random answer shuffling for GPQA task
src/lighteval/tasks/tasks/aime.py Adds math scorer with prompt template and sample conversion for AIME task
src/lighteval/tasks/tasks/agieval.py Adds multiple-choice solver and choice scorer with sample conversion for AGIEval task
src/lighteval/tasks/lighteval_task.py Adds Inspect AI compatible configuration fields to LightevalTaskConfig
src/lighteval/models/abstract_model.py Adds InspectAIModelConfig class for Inspect AI model configuration
src/lighteval/metrics/utils/extractive_match_utils.py Refactors get_extraction_regexes to accept len_choices parameter instead of Doc object
src/lighteval/metrics/metrics.py Implements math_scorer and multichoice_scorer for Inspect AI integration
src/lighteval/main.py Adds Inspect AI evaluation backend command
Comments suppressed due to low confidence (1)

src/lighteval/metrics/metrics.py:108

  • This statement is unreachable.
        return Score(value=1)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@NathanHB NathanHB changed the base branch from nathan-reorg-tasks to main October 30, 2025 12:37
Copy link
Member

@clefourrier clefourrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need a review at this stage?


def record_to_sample(record):
return Sample(
input=record["question"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're not using the same prompt as in the base eval (with "Question: ... " etc appended)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then you need to revert ours for consistency

@NathanHB
Copy link
Member Author

do you need a review at this stage?
yes ! it's running, and even though we do have every tasks and every bells and whistles it should be good to review

Copy link
Member

@clefourrier clefourrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, a couple nits atm, but cool work, looking forward to how the code base will simplify with inspect!

You're also missing a doc update for the whole new feature

target.text, gold_extraction_regexes, fallback_mode, extraction_mode, timeout_seconds
)
return Score(
value="C" if extracted_predictions == extracted_gold else "I",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain this line

Comment on lines +113 to +126
def multichoice_scorer():
language = Language.ENGLISH
gold_extraction_target = (
IndicesExtractionConfig(prefix_for_extraction="NativeLetters", try_extract_without_anchor=True),
)
pred_extraction_target = (
IndicesExtractionConfig(prefix_for_extraction="NativeLetters", try_extract_without_anchor=True),
)
fallback_mode = "first_match"
extraction_mode = "first_match"
timeout_seconds = 5

gold_extraction_regexes = get_extraction_regexes_inspect(gold_extraction_target, language)
pred_extraction_regexes = get_extraction_regexes_inspect(pred_extraction_target, language)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this behavior of nested functions behaving as classes is really meh for legibility, customizability and maintenability

Comment on lines +159 to +179
max_tokens: int | None = None
system_message: str | None = None
temperature: float | None = None
top_p: float | None = None
top_k: int | None = None
frequence_penalty: float | None = None
presence_penalty: float | None = None
seed: int | None = None
stop_seqs: list[str] | None = None
num_choices: int | None = None
best_of: int | None = None
log_probs: bool | None = None
top_logprobs: int | None = None
cache_prompt: bool | None = None
reasoning_effort: int | None = None
reasoning_tokens: int | None = None
reasoning_history: bool | None = None
response_format: str | None = None
parallel_tool_calls: bool | None = None
max_tool_output: int | None = None
internal_tools: bool | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we factorize with the other model classes?

)
def ifeval_scorer():
async def score(state: TaskState, target: Target):
response = state.output.completion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would probably be simpler if you put the preprocessing in its own function, so you can easily read the score

def record_to_sample(record):
# we need to remove prepended (A), (B), (C), (D) from the choices
choices = [
c.replace("(A)", "").replace("(B)", "").replace("(C)", "").replace("(D)", "").strip()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm did you check if there are also spaces after the letters? "(A) Sentence" for ex ?

return Sample(
input=record["Question"].strip(),
choices=choices,
target=ascii_uppercase[gold_index],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we used to use LETTERS in place of this, LETTERS can probably be removed from the tasks then


# Inspect AI compatible parameters
solver: None = None
scorer: None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would make sense to factorize to avoid having 2 different ways to launch evals, it will mess up the source of truth

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factorize with current metrics? it would be quite messy i think. having it separate for now seems better as we will in term only use the scorer mechanic!

Comment on lines +110 to +131
def results_to_markdown_table(
results_per_model_per_task,
metric: str = "accuracy",
stderr_metric: str = "stderr",
max_total_columns: int | None = None,
means_only_task_threshold: int = 10,
) -> str:
cols = _collect_columns(results_per_model_per_task, means_only_task_threshold, max_total_columns)

writer = MarkdownTableWriter()
writer.headers = ["Model"] + cols

rows = []
for model in sorted(results_per_model_per_task.keys()):
row = [model]
data = results_per_model_per_task[model]
for col in cols:
row.append(_format_metric_cell(data, col, metric, stderr_metric))
rows.append(row)

writer.value_matrix = rows
return writer.dumps()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you reuse the output functions we already have?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants