You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
nanoeval is pre-release software and may have breaking changes, so it's recommended that you pin your installation to a specific commit. The uv command above will do this for you.
4
15
5
16
# Principles
6
17
@@ -13,8 +24,8 @@ Simple, ergonomic, and high performance evals.
13
24
14
25
-`Eval` - A [chz](https://github.com/openai/chz) class. Enumerates a set of tasks, and (typically) uses a "Solver" to solve them and then records the results. Can be configured in code or on the CLI using a chz entrypoint.
15
26
-`EvalSpec` - An eval to run and runtime characteristics of how to run it (i.e. concurrency, recording, other administrivia)
16
-
-`Task` - A separable, scoreable unit of work.
17
-
-`Solver` - A strategy (usually involving sampling a model) to go from a task to a result that can be scored. For example, there may be different ways to prompt a model to answer a multiple-choice question (i.e. looking at logits, using consensus, etc)
27
+
-`Task` - A single scoreable unit of work.
28
+
-`Solver` - A strategy (usually involving sampling a model) to go from a task to a result that can be scored. For example, there may be different ways to prompt a model to answer a multiple-choice question (i.e. looking at logits, few-shot prompting, etc)
18
29
19
30
# Running your first eval
20
31
@@ -33,9 +44,15 @@ The executors can operate in two modes:
33
44
1.**In-process:** The executor is just an async task running in the same process as the main eval script. The default.
34
45
2.**Multiprocessing:** Starts a pool of executor processes that all poll the db. Use this via `spec.runner.experimental_use_multiprocessing=True`.
35
46
36
-
## The monitor
47
+
## Performance
48
+
49
+
nanoeval has been tested up to ~5,000 concurrent rollouts. It is likely that it can go higher.
50
+
51
+
For highest performance, use multiprocessing with as many processes as your system memory + core count allows. See `RunnerArgs` for documentation.
52
+
53
+
## Monitoring
37
54
38
-
Nanoeval has a tiny built-in monitor to track ongoing evals. It's a streamlit that visualizes the state of the internal run state database. This can be helpful to diagnose hangs on specific tasks. To use it:
55
+
nanoeval has a tiny built-in monitor to track ongoing evals. It's a streamlit that visualizes the state of the internal run state database. This can be helpful to diagnose hangs on specific tasks. To use it:
39
56
40
57
```bash
41
58
# either set spec.runner.use_monitor=True OR run this command:
@@ -44,25 +61,28 @@ python3 -m nanoeval.bin.mon
44
61
45
62
## Resumption
46
63
47
-
Because nanoeval uses a persistent database to track the state of individual tasks in a run, this means you can restart an in-progress eval if it crashes. To do this:
64
+
Because nanoeval uses a persistent database to track the state of individual tasks in a run, you can restart an in-progress eval if it crashes. (In-progress rollouts will be restarted from scratch, but completed rollouts will be saved.) To do this:
48
65
49
66
```bash
50
-
python3 -m nanoeval.extras.resume db_name=<NAME OF YOUR RUN DB>
67
+
# Restarts the eval in a new process
68
+
python3 -m nanoeval.extras.resume run_set_id=...
51
69
```
52
70
53
-
The `db_name` is typically autogenerated and looks something like `<computer hostname>-<pid>`. You can list all your databases with:
71
+
You can list all run sets (databases) using the following command:
54
72
55
73
```bash
56
-
ls -lh ~/Library/Application Support/nanoeval/run_state/
74
+
ls -lh "$(python3 -c "from nanoeval.fs_paths import database_dir; print(database_dir())")"
57
75
```
58
76
59
-
# **Writing your first eval**
77
+
The run set ID for each database is simply the filename, without the `.db*` suffix.
78
+
79
+
# Writing your first eval
60
80
61
81
An eval is just a `chz` class that defines `get_name()`, `get_tasks()`, `evaluate()` and `get_summary()`. Start with `gpqa_simple.py`; copy it and modify it to suit your needs. If necessary, drop down to the base `nanoeval.Eval` class instead of using `MCQEval`.
62
82
63
83
The following sections describe common use case needs and how to achieve them.
64
84
65
-
## **Public API**
85
+
## Public API
66
86
67
87
You may import code from any `nanoeval.*` package that does not start with an underscore. Functions and classes that start with an underscore are considered private.
68
88
@@ -90,18 +110,23 @@ class MCQEval(Eval[MCQTask, Answer]):
90
110
91
111
# Debugging
92
112
93
-
Is your big eval not working? Check here.
94
-
95
-
## Killing old executors
113
+
## Kill dangling executors
96
114
97
-
Sometimes, if you ctrl-c the main job, executors don’t have time to exit. A quick fix:
115
+
Nanoeval uses `multiprocessing` to execute rollouts in parallel. Sometimes, if you ctrl-c the main job, the multiprocessing executors don’t have time to exit. A quick fix:
98
116
99
117
```bash
100
118
pkill -f multiprocessing.spawn
101
119
```
102
120
103
-
## Observability
104
-
105
-
### py-spy/aiomonitor
121
+
## Debugging stuck runs
106
122
107
123
`py-spy` is an excellent tool to figure out where processes are stuck if progress isn’t happening. You can check the monitor to find the PIDs of all the executors and py-spy them one by one. The executors also run `aiomonitor`, so you can connect to them via `python3 -m aiomonitor.cli ...` to inspect async tasks.
124
+
125
+
## Diagnosing main thread stalls
126
+
127
+
nanoeval relies heavily on Python asyncio for concurrency within each executor process; thus, if you block the main thread, this will harm performance and lead to main thread stalls. A common footgun is making a synchronous LLM or HTTP call, which can stall the main thread for dozens of seconds.
128
+
129
+
Tracking down blocking calls can be annoying, so nanoeval comes with some built-in features to diagnose these.
130
+
131
+
1. Blocking synchronous calls will trigger a stacktrace dump to a temporary directory. You can see them by running `open "$(python3 -c "from nanoeval.fs_paths import stacktrace_root_dir; print(stacktrace_root_dir())")"`.
132
+
2. Blocking synchronous calls will also trigger a console warning.
0 commit comments