Add run-eval skill for evaluation guidance #1809

neubig · 2026-01-24T03:03:21Z

Adds a concise skill documenting how to trigger and monitor evaluation runs.

Covers:

GitHub API trigger with key parameters
PR label triggers
Datadog and kubectl monitoring
Common errors and limits

@neubig can click here to continue refining the PR

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:ff82409-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-ff82409-python \
  ghcr.io/openhands/agent-server:ff82409-python

All tags pushed for this build

ghcr.io/openhands/agent-server:ff82409-golang-amd64
ghcr.io/openhands/agent-server:ff82409-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:ff82409-golang-arm64
ghcr.io/openhands/agent-server:ff82409-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:ff82409-java-amd64
ghcr.io/openhands/agent-server:ff82409-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:ff82409-java-arm64
ghcr.io/openhands/agent-server:ff82409-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:ff82409-python-amd64
ghcr.io/openhands/agent-server:ff82409-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:ff82409-python-arm64
ghcr.io/openhands/agent-server:ff82409-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:ff82409-golang
ghcr.io/openhands/agent-server:ff82409-java
ghcr.io/openhands/agent-server:ff82409-python

About Multi-Architecture Support

Each variant tag (e.g., ff82409-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., ff82409-python-amd64) are also available if needed

Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-01-26T12:21:54Z

@OpenHands this is in the old format. clone https://github.com/OpenHands/skills and follow the format of skills there

openhands-ai · 2026-01-26T12:22:04Z

I'm on it! neubig can track my progress at all-hands.dev

- Remove deprecated fields (type, version, agent) from frontmatter - Add description field to all skill files - Add name field to code-review.md Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-01-26T12:24:59Z

I've updated all skill files in the .openhands/skills/ directory to match the new format used in the OpenHands/skills repository. The changes have been pushed to the add-run-eval-skill branch.

Changes Made

Updated the frontmatter format for all 3 skill files:

run-eval.md (the main file for this PR):
- Removed deprecated fields: type, version, agent
- Added description field
write-behavior-test.md:
- Removed deprecated fields: type, version, agent
- Added description field
code-review.md:
- Added name field
- Added description field

New Format

The skill files now follow the format from OpenHands/skills:

---
name: <skill-name>
description: <description of what the skill does and when to use it>
triggers:
- <trigger keywords>
---

This replaces the old format which had type, version, and agent fields instead of description.

The PR has been updated: #1809

View full conversation

all-hands-bot

Good addition of evaluation guidance. Found several areas where clarity could be improved, particularly around placeholders and prerequisites.

.openhands/skills/run-eval.md

Co-authored-by: OpenHands Bot <contact@all-hands.dev>

neubig · 2026-01-26T13:04:52Z

@OpenHands check the comments from all-hands-bot and my responses, and fix. Also, add a note that when running a full eval you must select a number of instances that is greater than or equal to the actual number of instances.

openhands-ai · 2026-01-26T13:05:02Z

I'm on it! neubig can track my progress at all-hands.dev

- Remove alternative PR labels section - Fix code block formatting for Datadog script - Clarify kubectl section is for users with cluster access (agent doesn't have kubectl) - Update 503 error fix to ask user to stop evals using kubectl - Remove hardcoded swebenchmultimodal instance count - Add note about eval_limit needing to be >= actual instances for full evals Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-01-26T13:07:01Z

I've addressed all the PR review comments from all-hands-bot and neubig's responses. The changes have been pushed to the add-run-eval-skill branch.

Changes Made

Requested Change	Status
Remove alternative PR labels section (line 39)	✅ Done
Fix code block formatting for Datadog script (missing closing ```)	✅ Done
Clarify kubectl section is for users with cluster access (agent doesn't have kubectl)	✅ Done
Update 503 error fix to "Ask user to stop some evaluation runs using kubectl"	✅ Done
Remove hardcoded swebenchmultimodal instance count (line 69)	✅ Done
Add note about eval_limit needing to be >= actual instances for full evals	✅ Done

Summary of Changes

Replaced the "Alternative: Add labels..." line with a Note about full eval requirements: "When running a full eval, you must select an eval_limit that is greater than or equal to the actual number of instances in the benchmark."
Fixed the Datadog script code block by adding the missing closing ``` delimiter.
Updated kubectl section header to clarify: "(for users with cluster access - the agent does not have kubectl access)"
Changed the 503 error fix from "Reduce concurrent evals to 2-3" to "Ask user to stop some evaluation runs using kubectl"
Removed the hardcoded "swebenchmultimodal: 102 instances" line from the Limits section.

PR #1809 has been updated: #1809

View full conversation

all-hands-bot

Overall useful addition for evaluation workflows. Found several clarity and consistency issues that should be addressed before merging.

.openhands/skills/run-eval.md

openhands-ai · 2026-01-26T13:12:34Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Agent Server

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1809 at branch `add-run-eval-skill`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

Co-authored-by: OpenHands Bot <contact@all-hands.dev>

neubig · 2026-01-26T13:23:09Z

@OpenHands Check the unresolved comments from all-hands-bot and my responses. Fix according to my responses. And then mark the comments and "resolved" using the github API.

openhands-ai · 2026-01-26T13:23:19Z

I'm on it! neubig can track my progress at all-hands.dev

… kubectl references - Change 'Set' to 'are set' for Datadog environment variables (they are typically set automatically) - Remove unnecessary clone/cd evaluation repo comments - Remove kubectl mention from 503 error fix Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-01-26T13:26:53Z

I've addressed all the unresolved PR review comments from all-hands-bot based on neubig's responses and marked them as resolved.

Changes Made

neubig's Feedback	Action Taken
"It should not be 'Set', but 'are set' (since they will typically be set automatically)"	Updated Datadog script description to say "DD_API_KEY, DD_APP_KEY, and DD_SITE environment variables are set"
"Unnecessary" (about clone/cd evaluation repo comments)	Removed the 3 lines with clone/cd instructions from the code block
"Just remove the line 53 mention of kubectl" (about 503 error fix)	Removed "using kubectl" from the 503 error fix in the Common Errors table

Resolved Review Threads

All 4 unresolved review threads have been marked as resolved via the GitHub GraphQL API:

PRRT_kwDOPjFrIs5rB2z6 - DD_API_KEY syntax (neubig approved it)
PRRT_kwDOPjFrIs5rB20E - Datadog credentials note (addressed with "are set")
PRRT_kwDOPjFrIs5rB20K - kubectl placeholders (addressed by removing kubectl mention)
PRRT_kwDOPjFrIs5rB20O - 503 error kubectl contradiction (addressed by removing kubectl mention)

The changes have been pushed to the add-run-eval-skill branch.

View full conversation

juanmichelini

LGTM

Add run-eval skill for evaluation guidance

93c68d1

Co-authored-by: openhands <openhands@all-hands.dev>

Update skill files to new format with name and description

8842dee

- Remove deprecated fields (type, version, agent) from frontmatter - Add description field to all skill files - Add name field to code-review.md Co-authored-by: openhands <openhands@all-hands.dev>

neubig marked this pull request as ready for review January 26, 2026 12:49

neubig requested a review from xingyaoww January 26, 2026 12:49

all-hands-bot reviewed Jan 26, 2026

View reviewed changes

neubig requested review from juanmichelini and removed request for xingyaoww January 26, 2026 12:59

neubig and others added 2 commits January 26, 2026 07:59

Update .openhands/skills/run-eval.md

2bc9126

Co-authored-by: OpenHands Bot <contact@all-hands.dev>

Update .openhands/skills/run-eval.md

f7e0d56

Co-authored-by: OpenHands Bot <contact@all-hands.dev>

neubig marked this pull request as draft January 26, 2026 13:05

neubig changed the title ~~DRAFT: Add run-eval skill for evaluation guidance~~ Add run-eval skill for evaluation guidance Jan 26, 2026

neubig marked this pull request as ready for review January 26, 2026 13:09

Merge branch 'main' into add-run-eval-skill

18edb8f

all-hands-bot reviewed Jan 26, 2026

View reviewed changes

neubig and others added 2 commits January 26, 2026 08:22

Update .openhands/skills/run-eval.md

185d765

Co-authored-by: OpenHands Bot <contact@all-hands.dev>

Update .openhands/skills/run-eval.md

c1ddab4

Co-authored-by: OpenHands Bot <contact@all-hands.dev>

juanmichelini approved these changes Jan 27, 2026

View reviewed changes

neubig merged commit 004b444 into main Jan 27, 2026
25 checks passed

neubig deleted the add-run-eval-skill branch January 27, 2026 19:58

Add run-eval skill for evaluation guidance #1809

Add run-eval skill for evaluation guidance #1809

Uh oh!

Conversation

neubig commented Jan 24, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neubig commented Jan 26, 2026

Uh oh!

openhands-ai bot commented Jan 26, 2026

Uh oh!

openhands-ai bot commented Jan 26, 2026

Changes Made

New Format

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

neubig commented Jan 26, 2026

Uh oh!

openhands-ai bot commented Jan 26, 2026

Uh oh!

openhands-ai bot commented Jan 26, 2026

Changes Made

Summary of Changes

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openhands-ai bot commented Jan 26, 2026

Uh oh!

neubig commented Jan 26, 2026

Uh oh!

openhands-ai bot commented Jan 26, 2026

Uh oh!

openhands-ai bot commented Jan 26, 2026

Changes Made

Resolved Review Threads

Uh oh!

juanmichelini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

neubig commented Jan 24, 2026 •

edited by github-actions bot

Loading