-
Notifications
You must be signed in to change notification settings - Fork 140
Add run-eval skill for evaluation guidance #1809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: openhands <openhands@all-hands.dev>
|
@OpenHands this is in the old format. clone https://github.com/OpenHands/skills and follow the format of skills there |
|
I'm on it! neubig can track my progress at all-hands.dev |
- Remove deprecated fields (type, version, agent) from frontmatter - Add description field to all skill files - Add name field to code-review.md Co-authored-by: openhands <openhands@all-hands.dev>
|
I've updated all skill files in the Changes MadeUpdated the frontmatter format for all 3 skill files:
New FormatThe skill files now follow the format from OpenHands/skills: ---
name: <skill-name>
description: <description of what the skill does and when to use it>
triggers:
- <trigger keywords>
---This replaces the old format which had The PR has been updated: #1809 |
all-hands-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good addition of evaluation guidance. Found several areas where clarity could be improved, particularly around placeholders and prerequisites.
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
|
@OpenHands check the comments from all-hands-bot and my responses, and fix. Also, add a note that when running a full eval you must select a number of instances that is greater than or equal to the actual number of instances. |
|
I'm on it! neubig can track my progress at all-hands.dev |
- Remove alternative PR labels section - Fix code block formatting for Datadog script - Clarify kubectl section is for users with cluster access (agent doesn't have kubectl) - Update 503 error fix to ask user to stop evals using kubectl - Remove hardcoded swebenchmultimodal instance count - Add note about eval_limit needing to be >= actual instances for full evals Co-authored-by: openhands <openhands@all-hands.dev>
|
I've addressed all the PR review comments from all-hands-bot and neubig's responses. The changes have been pushed to the Changes Made
Summary of Changes
|
all-hands-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall useful addition for evaluation workflows. Found several clarity and consistency issues that should be addressed before merging.
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
|
@OpenHands Check the unresolved comments from all-hands-bot and my responses. Fix according to my responses. And then mark the comments and "resolved" using the github API. |
|
I'm on it! neubig can track my progress at all-hands.dev |
… kubectl references - Change 'Set' to 'are set' for Datadog environment variables (they are typically set automatically) - Remove unnecessary clone/cd evaluation repo comments - Remove kubectl mention from 503 error fix Co-authored-by: openhands <openhands@all-hands.dev>
|
I've addressed all the unresolved PR review comments from all-hands-bot based on neubig's responses and marked them as resolved. Changes Made
Resolved Review ThreadsAll 4 unresolved review threads have been marked as resolved via the GitHub GraphQL API:
The changes have been pushed to the |
juanmichelini
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Adds a concise skill documenting how to trigger and monitor evaluation runs.
Covers:
@neubig can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:ff82409-pythonRun
All tags pushed for this build
About Multi-Architecture Support
ff82409-python) is a multi-arch manifest supporting both amd64 and arm64ff82409-python-amd64) are also available if needed