Skip to content

Commit 65f5897

Browse files
authoredFeb 26, 2025··
[nanoeval] update readme (#45)
1 parent 25ce1b8 commit 65f5897

File tree

1 file changed

+43
-18
lines changed

1 file changed

+43
-18
lines changed
 

‎project/nanoeval/README.md

+43-18
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,17 @@
11
# nanoeval
22

3-
Simple, ergonomic, and high performance evals.
3+
Simple, ergonomic, and high performance evals. We use it at OpenAI as part of our infrastructure to run Preparedness evaluations.
4+
5+
# Installation
6+
7+
```bash
8+
# Using https://github.com/astral-sh/uv (recommended)
9+
uv add "git+https://github.com/openai/SWELancer-Benchmark#egg=nanoeval&subdirectory=project/nanoeval"
10+
# Using pip
11+
pip install "git+https://github.com/openai/SWELancer-Benchmark#egg=nanoeval&subdirectory=project/nanoeval"
12+
```
13+
14+
nanoeval is pre-release software and may have breaking changes, so it's recommended that you pin your installation to a specific commit. The uv command above will do this for you.
415

516
# Principles
617

@@ -13,8 +24,8 @@ Simple, ergonomic, and high performance evals.
1324

1425
- `Eval` - A [chz](https://github.com/openai/chz) class. Enumerates a set of tasks, and (typically) uses a "Solver" to solve them and then records the results. Can be configured in code or on the CLI using a chz entrypoint.
1526
- `EvalSpec` - An eval to run and runtime characteristics of how to run it (i.e. concurrency, recording, other administrivia)
16-
- `Task` - A separable, scoreable unit of work.
17-
- `Solver` - A strategy (usually involving sampling a model) to go from a task to a result that can be scored. For example, there may be different ways to prompt a model to answer a multiple-choice question (i.e. looking at logits, using consensus, etc)
27+
- `Task` - A single scoreable unit of work.
28+
- `Solver` - A strategy (usually involving sampling a model) to go from a task to a result that can be scored. For example, there may be different ways to prompt a model to answer a multiple-choice question (i.e. looking at logits, few-shot prompting, etc)
1829

1930
# Running your first eval
2031

@@ -33,9 +44,15 @@ The executors can operate in two modes:
3344
1. **In-process:** The executor is just an async task running in the same process as the main eval script. The default.
3445
2. **Multiprocessing:** Starts a pool of executor processes that all poll the db. Use this via `spec.runner.experimental_use_multiprocessing=True`.
3546

36-
## The monitor
47+
## Performance
48+
49+
nanoeval has been tested up to ~5,000 concurrent rollouts. It is likely that it can go higher.
50+
51+
For highest performance, use multiprocessing with as many processes as your system memory + core count allows. See `RunnerArgs` for documentation.
52+
53+
## Monitoring
3754

38-
Nanoeval has a tiny built-in monitor to track ongoing evals. It's a streamlit that visualizes the state of the internal run state database. This can be helpful to diagnose hangs on specific tasks. To use it:
55+
nanoeval has a tiny built-in monitor to track ongoing evals. It's a streamlit that visualizes the state of the internal run state database. This can be helpful to diagnose hangs on specific tasks. To use it:
3956

4057
```bash
4158
# either set spec.runner.use_monitor=True OR run this command:
@@ -44,25 +61,28 @@ python3 -m nanoeval.bin.mon
4461

4562
## Resumption
4663

47-
Because nanoeval uses a persistent database to track the state of individual tasks in a run, this means you can restart an in-progress eval if it crashes. To do this:
64+
Because nanoeval uses a persistent database to track the state of individual tasks in a run, you can restart an in-progress eval if it crashes. (In-progress rollouts will be restarted from scratch, but completed rollouts will be saved.) To do this:
4865

4966
```bash
50-
python3 -m nanoeval.extras.resume db_name=<NAME OF YOUR RUN DB>
67+
# Restarts the eval in a new process
68+
python3 -m nanoeval.extras.resume run_set_id=...
5169
```
5270

53-
The `db_name` is typically autogenerated and looks something like `<computer hostname>-<pid>`. You can list all your databases with:
71+
You can list all run sets (databases) using the following command:
5472

5573
```bash
56-
ls -lh ~/Library/Application Support/nanoeval/run_state/
74+
ls -lh "$(python3 -c "from nanoeval.fs_paths import database_dir; print(database_dir())")"
5775
```
5876

59-
# **Writing your first eval**
77+
The run set ID for each database is simply the filename, without the `.db*` suffix.
78+
79+
# Writing your first eval
6080

6181
An eval is just a `chz` class that defines `get_name()`, `get_tasks()`, `evaluate()` and `get_summary()`. Start with `gpqa_simple.py`; copy it and modify it to suit your needs. If necessary, drop down to the base `nanoeval.Eval` class instead of using `MCQEval`.
6282

6383
The following sections describe common use case needs and how to achieve them.
6484

65-
## **Public API**
85+
## Public API
6686

6787
You may import code from any `nanoeval.*` package that does not start with an underscore. Functions and classes that start with an underscore are considered private.
6888

@@ -90,18 +110,23 @@ class MCQEval(Eval[MCQTask, Answer]):
90110

91111
# Debugging
92112

93-
Is your big eval not working? Check here.
94-
95-
## Killing old executors
113+
## Kill dangling executors
96114

97-
Sometimes, if you ctrl-c the main job, executors don’t have time to exit. A quick fix:
115+
Nanoeval uses `multiprocessing` to execute rollouts in parallel. Sometimes, if you ctrl-c the main job, the multiprocessing executors don’t have time to exit. A quick fix:
98116

99117
```bash
100118
pkill -f multiprocessing.spawn
101119
```
102120

103-
## Observability
104-
105-
### py-spy/aiomonitor
121+
## Debugging stuck runs
106122

107123
`py-spy` is an excellent tool to figure out where processes are stuck if progress isn’t happening. You can check the monitor to find the PIDs of all the executors and py-spy them one by one. The executors also run `aiomonitor`, so you can connect to them via `python3 -m aiomonitor.cli ...` to inspect async tasks.
124+
125+
## Diagnosing main thread stalls
126+
127+
nanoeval relies heavily on Python asyncio for concurrency within each executor process; thus, if you block the main thread, this will harm performance and lead to main thread stalls. A common footgun is making a synchronous LLM or HTTP call, which can stall the main thread for dozens of seconds.
128+
129+
Tracking down blocking calls can be annoying, so nanoeval comes with some built-in features to diagnose these.
130+
131+
1. Blocking synchronous calls will trigger a stacktrace dump to a temporary directory. You can see them by running `open "$(python3 -c "from nanoeval.fs_paths import stacktrace_root_dir; print(stacktrace_root_dir())")"`.
132+
2. Blocking synchronous calls will also trigger a console warning.

0 commit comments

Comments
 (0)
Please sign in to comment.