Save and output number of samples of each task #851

itsmejul · 2025-07-03T18:26:18Z

This PR closes #804 .

What does this PR do?

This PR adds the num_samples field to both the results_dict that is saved as json, but also the final_dict that is passed to make_results_table() as requested in the issue. All existing elements in these dicts are left unchanged.

  "results": {
    "lighteval|gsm8k|0": {
      "extractive_match": 0.6,
      "extractive_match_stderr": 0.1632993161855452
    },
    "all": {
      "extractive_match": 0.6,
      "extractive_match_stderr": 0.1632993161855452
    }
  },
  "num_samples": {
    "lighteval|gsm8k|0": 10,
    "all": 10
  }

The keys in num_samples are the exact same as the keys in results (meaning we calculate the number of samples for each individual task, as well as all grouped tasks by summing their subtasks, and the "all" task), allowing us to add the number of samples to the markdown table created in make_results_table() like so:

To guarantee backwards compatibility in make_results_table(), the "Number of Samples" fields will just be empty in the case that the result_dict does not contain num_samples.
The samples are counted via the length of each entry in details_logger.details.

Changes

Added calculate_num_samples() method in EvaluationTracker
Added num_samples field to results_dict in EvaluationTracker.save()
Added num_samples field to final_dict in EvaluationTracker.generate_final_dict()
Added "Number of Samples" field to markdown table generated in make_results_table()
Modified example results.json in docs to include the new entry

Tests

All tests passed locally.

…mber of samples per task

itsmejul added 5 commits July 3, 2025 19:04

Create calculate_num_samples method in evaluation_tracker to count nu…

d2064d2

…mber of samples per task

add num_samples to final_dict of evaluation_tracker

9d951c8

add num_samples to results_dict in EvaluationTracker.save()

f461f75

Add num_samples to the markdown table printed by make_results_table()

6c860f2

Add num_samples entry in example results.json in docs

dcd760d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Save and output number of samples of each task #851

Save and output number of samples of each task #851

Uh oh!

itsmejul commented Jul 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Save and output number of samples of each task #851

Are you sure you want to change the base?

Save and output number of samples of each task #851

Uh oh!

Conversation

itsmejul commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Changes

Tests

Uh oh!

Uh oh!

itsmejul commented Jul 3, 2025 •

edited

Loading