Skip to content

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Jan 24, 2026

Adds a concise skill documenting how to trigger and monitor evaluation runs.

Covers:

  • GitHub API trigger with key parameters
  • PR label triggers
  • Datadog and kubectl monitoring
  • Common errors and limits

@neubig can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:ff82409-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-ff82409-python \
  ghcr.io/openhands/agent-server:ff82409-python

All tags pushed for this build

ghcr.io/openhands/agent-server:ff82409-golang-amd64
ghcr.io/openhands/agent-server:ff82409-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:ff82409-golang-arm64
ghcr.io/openhands/agent-server:ff82409-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:ff82409-java-amd64
ghcr.io/openhands/agent-server:ff82409-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:ff82409-java-arm64
ghcr.io/openhands/agent-server:ff82409-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:ff82409-python-amd64
ghcr.io/openhands/agent-server:ff82409-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:ff82409-python-arm64
ghcr.io/openhands/agent-server:ff82409-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:ff82409-golang
ghcr.io/openhands/agent-server:ff82409-java
ghcr.io/openhands/agent-server:ff82409-python

About Multi-Architecture Support

  • Each variant tag (e.g., ff82409-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., ff82409-python-amd64) are also available if needed

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig
Copy link
Contributor Author

neubig commented Jan 26, 2026

@OpenHands this is in the old format. clone https://github.com/OpenHands/skills and follow the format of skills there

@openhands-ai
Copy link

openhands-ai bot commented Jan 26, 2026

I'm on it! neubig can track my progress at all-hands.dev

- Remove deprecated fields (type, version, agent) from frontmatter
- Add description field to all skill files
- Add name field to code-review.md

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Jan 26, 2026

I've updated all skill files in the .openhands/skills/ directory to match the new format used in the OpenHands/skills repository. The changes have been pushed to the add-run-eval-skill branch.

Changes Made

Updated the frontmatter format for all 3 skill files:

  1. run-eval.md (the main file for this PR):

    • Removed deprecated fields: type, version, agent
    • Added description field
  2. write-behavior-test.md:

    • Removed deprecated fields: type, version, agent
    • Added description field
  3. code-review.md:

    • Added name field
    • Added description field

New Format

The skill files now follow the format from OpenHands/skills:

---
name: <skill-name>
description: <description of what the skill does and when to use it>
triggers:
- <trigger keywords>
---

This replaces the old format which had type, version, and agent fields instead of description.

The PR has been updated: #1809

View full conversation

@neubig neubig marked this pull request as ready for review January 26, 2026 12:49
@neubig neubig requested a review from xingyaoww January 26, 2026 12:49
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good addition of evaluation guidance. Found several areas where clarity could be improved, particularly around placeholders and prerequisites.

@neubig neubig requested review from juanmichelini and removed request for xingyaoww January 26, 2026 12:59
neubig and others added 2 commits January 26, 2026 07:59
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
@neubig
Copy link
Contributor Author

neubig commented Jan 26, 2026

@OpenHands check the comments from all-hands-bot and my responses, and fix. Also, add a note that when running a full eval you must select a number of instances that is greater than or equal to the actual number of instances.

@neubig neubig marked this pull request as draft January 26, 2026 13:05
@openhands-ai
Copy link

openhands-ai bot commented Jan 26, 2026

I'm on it! neubig can track my progress at all-hands.dev

- Remove alternative PR labels section
- Fix code block formatting for Datadog script
- Clarify kubectl section is for users with cluster access (agent doesn't have kubectl)
- Update 503 error fix to ask user to stop evals using kubectl
- Remove hardcoded swebenchmultimodal instance count
- Add note about eval_limit needing to be >= actual instances for full evals

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Jan 26, 2026

I've addressed all the PR review comments from all-hands-bot and neubig's responses. The changes have been pushed to the add-run-eval-skill branch.

Changes Made

Requested Change Status
Remove alternative PR labels section (line 39) ✅ Done
Fix code block formatting for Datadog script (missing closing ```) ✅ Done
Clarify kubectl section is for users with cluster access (agent doesn't have kubectl) ✅ Done
Update 503 error fix to "Ask user to stop some evaluation runs using kubectl" ✅ Done
Remove hardcoded swebenchmultimodal instance count (line 69) ✅ Done
Add note about eval_limit needing to be >= actual instances for full evals ✅ Done

Summary of Changes

  1. Replaced the "Alternative: Add labels..." line with a Note about full eval requirements: "When running a full eval, you must select an eval_limit that is greater than or equal to the actual number of instances in the benchmark."

  2. Fixed the Datadog script code block by adding the missing closing ``` delimiter.

  3. Updated kubectl section header to clarify: "(for users with cluster access - the agent does not have kubectl access)"

  4. Changed the 503 error fix from "Reduce concurrent evals to 2-3" to "Ask user to stop some evaluation runs using kubectl"

  5. Removed the hardcoded "swebenchmultimodal: 102 instances" line from the Limits section.

PR #1809 has been updated: #1809

View full conversation

@neubig neubig changed the title DRAFT: Add run-eval skill for evaluation guidance Add run-eval skill for evaluation guidance Jan 26, 2026
@neubig neubig marked this pull request as ready for review January 26, 2026 13:09
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall useful addition for evaluation workflows. Found several clarity and consistency issues that should be addressed before merging.

@openhands-ai
Copy link

openhands-ai bot commented Jan 26, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Agent Server

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1809 at branch `add-run-eval-skill`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

neubig and others added 2 commits January 26, 2026 08:22
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
@neubig
Copy link
Contributor Author

neubig commented Jan 26, 2026

@OpenHands Check the unresolved comments from all-hands-bot and my responses. Fix according to my responses. And then mark the comments and "resolved" using the github API.

@openhands-ai
Copy link

openhands-ai bot commented Jan 26, 2026

I'm on it! neubig can track my progress at all-hands.dev

… kubectl references

- Change 'Set' to 'are set' for Datadog environment variables (they are typically set automatically)
- Remove unnecessary clone/cd evaluation repo comments
- Remove kubectl mention from 503 error fix

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Jan 26, 2026

I've addressed all the unresolved PR review comments from all-hands-bot based on neubig's responses and marked them as resolved.

Changes Made

neubig's Feedback Action Taken
"It should not be 'Set', but 'are set' (since they will typically be set automatically)" Updated Datadog script description to say "DD_API_KEY, DD_APP_KEY, and DD_SITE environment variables are set"
"Unnecessary" (about clone/cd evaluation repo comments) Removed the 3 lines with clone/cd instructions from the code block
"Just remove the line 53 mention of kubectl" (about 503 error fix) Removed "using kubectl" from the 503 error fix in the Common Errors table

Resolved Review Threads

All 4 unresolved review threads have been marked as resolved via the GitHub GraphQL API:

  • PRRT_kwDOPjFrIs5rB2z6 - DD_API_KEY syntax (neubig approved it)
  • PRRT_kwDOPjFrIs5rB20E - Datadog credentials note (addressed with "are set")
  • PRRT_kwDOPjFrIs5rB20K - kubectl placeholders (addressed by removing kubectl mention)
  • PRRT_kwDOPjFrIs5rB20O - 503 error kubectl contradiction (addressed by removing kubectl mention)

The changes have been pushed to the add-run-eval-skill branch.

View full conversation

Copy link
Collaborator

@juanmichelini juanmichelini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@neubig neubig merged commit 004b444 into main Jan 27, 2026
25 checks passed
@neubig neubig deleted the add-run-eval-skill branch January 27, 2026 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants