Skip to content

feat: Leaderboard Submission Script #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: base-sha/5350f947594f1393f0a46bccec214ffd94ca5dc1
Choose a base branch
from

Conversation

Hellebore
Copy link

@Hellebore Hellebore commented May 2, 2025

This PR introduces a new utility script (devops/add_to_leaderboard.sh) that streamlines the process of adding policies to our navigation evaluation leaderboard. The script handles the end-to-end workflow of submitting a policy, running the navigation simulation, and updating the dashboard.

Changes:

  • Add new shell script add_to_leaderboard.sh with clear documentation and error handling
  • Update selector type in sim_job.yaml from "top" to "latest" for consistent evaluation
  • Fix bug in policy_store.py to properly handle missing eval scores in metadata

Usage:

devops/add_to_leaderboard.sh -r <run_name> [+hardware=macbook] [other Hydra overrides]

Example:

devops/add_to_leaderboard.sh -r b.rwalters.64r +hardware=macbook

Dashboard URL

Summary by Sourcery

Introduce a new leaderboard submission script and improve policy selection and build processes for navigation evaluation

New Features:

  • Add new utility script add_to_leaderboard.sh for streamlined policy submission and evaluation

Bug Fixes:

  • Fix handling of missing evaluation scores in policy metadata
  • Add proper error handling for policy record local path

Enhancements:

  • Improve policy record selection logic in policy_store.py
  • Add dedicated build script for mettagrid
  • Update navigation evaluation configs to include self-defaults

Deployment:

  • Create script to automate leaderboard policy submission
  • Update dashboard generation process

Chores:

  • Refactor dependency build process
  • Improve logging in policy selection

@Hellebore
Copy link
Author

This is a benchmark review for experiment bakeoff.
Run ID: bakeoff/benchmark_2025-05-02T11-38-04_v1-36-0-dirty.

This pull request was cloned from https://github.com/Metta-AI/metta/pull/277. (Note: the URL is not a link to avoid triggering a notification on the original pull request.)

Experiment configuration
review_config:
  # User configuration for the review
  # - benchmark - use the user config from the benchmark reviews
  # - <value> - use the value directly
  user_review_config:
    enable_ai_review: true
    enable_rule_comments: false

    enable_complexity_comments: benchmark
    enable_security_comments: benchmark
    enable_tests_comments: benchmark
    enable_comment_suggestions: benchmark

    enable_pull_request_summary: benchmark
    enable_review_guide: benchmark

    enable_approvals: false
    base_branches: [base-sha.*]

  ai_review_config:
    # The model responses to use for the experiment
    # - benchmark - use the model responses from the benchmark reviews
    # - llm - call the language model to generate responses
    model_responses:
      comments_model: benchmark
      comment_validation_model: benchmark
      comment_suggestion_model: benchmark
      complexity_model: benchmark
      security_model: benchmark
      tests_model: benchmark
      pull_request_summary_model: benchmark
      review_guide_model: benchmark
      overall_comments_model: benchmark

# The pull request dataset to run the experiment on
pull_request_dataset:
# CodeRabbit
- https://github.com/neerajkumar161/node-coveralls-integration/pull/5
- https://github.com/gunner95/vertx-rest/pull/1
- https://github.com/Altinn/altinn-access-management-frontend/pull/1427
- https://github.com/theMr17/github-notifier/pull/14
- https://github.com/bearycool11/AI_memory_Loops/pull/142

# Greptile
- https://github.com/gumloop/guMCP/pull/119
- https://github.com/autoblocksai/python-sdk/pull/335
- https://github.com/grepdemos/ImageSharp/pull/6
- https://github.com/grepdemos/server/pull/61
- https://github.com/websentry-ai/pipelines/pull/25

# Graphite
- https://github.com/KittyCAD/modeling-app/pull/6648
- https://github.com/KittyCAD/modeling-app/pull/6628
- https://github.com/Varedis-Org/AI-Test-Repo/pull/2
- https://github.com/deeep-network/bedrock/pull/198
- https://github.com/Metta-AI/metta/pull/277

# Copilot
- https://github.com/hmcts/rpx-xui-webapp/pull/4438
- https://github.com/ganchdev/quez/pull/104
- https://github.com/xbcsmith/ymlfxr/pull/13
- https://github.com/tinapayy/B-1N1T/pull/36
- https://github.com/coder/devcontainer-features/pull/6

# Questions to ask to label the review comments
review_comment_labels: []
# - label: correct
#   question: Is this comment correct?

# Benchmark reviews generated by running
#   python -m scripts.experiment benchmark <experiment_name>
benchmark_reviews: []

@Hellebore
Copy link
Author

Reviewer's Guide

This pull request introduces a new shell script (devops/add_to_leaderboard.sh) that automates the process of evaluating a policy and adding it to the leaderboard by orchestrating existing simulation (tools.sim) and analysis (tools.analyze) tools. Supporting changes include refactoring the policy selection logic in policy_store.py for robustness against missing evaluation scores, extracting the mettagrid build process into a dedicated script, and making minor updates to Hydra configurations and trainer initialization.

File-Level Changes

Change Details Files
Added a new shell script to automate the leaderboard submission workflow.
  • Parses command-line arguments for run name and optional wandb path.
  • Executes tools.sim to run the navigation simulation for the specified policy.
  • Executes tools.analyze to process simulation results and update the dashboard.
  • Includes basic error handling for simulation and analysis steps.
devops/add_to_leaderboard.sh
Refactored policy selection logic to improve robustness and logging.
  • Enhanced the top selector logic to check for metrics in both eval_scores and top-level metadata.
  • Implemented fallback to the latest policy if the metric is missing or too many scores are invalid.
  • Added detailed logging throughout the policy selection process.
  • Added checks for None local paths before adding to wandb artifacts.
metta/agent/policy_store.py
Extracted the mettagrid build process into a separate script.
  • Created devops/build_mettagrid.sh to handle mettagrid compilation and installation.
  • Updated devops/checkout_and_build.sh to call the new dedicated build script.
devops/build_mettagrid.sh
devops/checkout_and_build.sh
Updated Hydra evaluation configurations and trainer initialization.
  • Added _self_ default to multiple Hydra evaluation config files.
  • Initialized eval_scores attribute in Trainer to an empty dictionary.
configs/env/mettagrid/navigation/evals/cylinder_easy.yaml
configs/env/mettagrid/navigation/evals/honeypot.yaml
configs/env/mettagrid/navigation/evals/knotty.yaml
configs/env/mettagrid/navigation/evals/memory_palace.yaml
configs/env/mettagrid/navigation/evals/radial_large.yaml
configs/env/mettagrid/navigation/evals/radial_mini.yaml
configs/env/mettagrid/navigation/evals/radial_small.yaml
configs/env/mettagrid/navigation/evals/swirls.yaml
configs/env/mettagrid/navigation/evals/thecube.yaml
configs/env/mettagrid/navigation/evals/walkaround.yaml
configs/env/mettagrid/navigation/evals/wanderout.yaml
metta/rl/pufferlib/trainer.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Author

@Hellebore Hellebore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Hellebore - I've reviewed your changes and found some issues that need to be addressed.

Blocking issues:

  • Using eval for command execution may introduce security concerns. (link)

  • The logic within the _policy_records method, particularly for the "top" selector, has grown complex; consider refactoring parts of this logic into smaller helper functions.

  • The add_to_leaderboard.sh script uses eval to run constructed commands; consider executing the commands directly to avoid potential quoting issues or security risks associated with eval.

Here's what I looked at during the review
  • 🟡 General issues: 2 issues found
  • 🔴 Security: 1 blocking issue
  • 🟢 Testing: all looks good
  • 🟢 Documentation: all looks good

LangSmith trace

Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@@ -74,7 +74,7 @@ def __init__(
self.eval_stats_logger = EvalStatsLogger(self.sim_suite_config, wandb_run)
self.average_reward = 0.0 # Initialize average reward estimate
self._current_eval_score = None
self.eval_scores = None
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Changed initialization of eval_scores to an empty dictionary.

Verify downstream code treats eval_scores as a dict—this avoids null checks but requires consistent usage.

Suggested implementation:

    if not self.eval_scores:
    self.eval_scores["latest"] = current_score

Depending on your downstream code usage you may have to:

  1. Change all conditional checks that compare self.eval_scores to None (e.g., “if self.eval_scores is None:”) into checks for emptiness (e.g., “if not self.eval_scores:”).
  2. Replace any method calls (like append, extend, etc.) on self.eval_scores with dictionary updates that use keys.
  3. Ensure that downstream code which reads from eval_scores uses the correct dictionary key(s) rather than assuming a list.
    Adjust the key names (“latest” in the example) to match the intended logic of your evaluation scoring.

echo "Step 2: Running simulation..."
SIM_CMD="python3 -m tools.sim sim=navigation run=\"$RUN_NAME\" policy_uri=\"$WANDB_PATH\" +eval_db_uri=wandb://artifacts/navigation_db $ADDITIONAL_ARGS"
echo "Executing: $SIM_CMD"
eval $SIM_CMD
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 suggestion (security): Using eval for command execution may introduce security concerns.

Sanitize all arguments to prevent shell injection, or use an array-based command invocation instead of eval.

Suggested implementation:

# Step 2: Run the simulation using array-based command execution
echo "Step 2: Running simulation..."
# Split ADDITIONAL_ARGS into an array
read -r -a additional_args <<< "$ADDITIONAL_ARGS"
SIM_CMD_ARRAY=(python3 -m tools.sim sim=navigation "run=${RUN_NAME}" "policy_uri=${WANDB_PATH}" "+eval_db_uri=wandb://artifacts/navigation_db")
SIM_CMD_ARRAY+=( "${additional_args[@]}" )
echo "Executing: ${SIM_CMD_ARRAY[*]}"
"${SIM_CMD_ARRAY[@]}"

# Step 3: Analyze and update dashboard using array-based command execution
echo "Step 3: Analyzing results and updating dashboard..."
# Split ADDITIONAL_ARGS into an array for analyze (if needed)
read -r -a additional_args_analyze <<< "$ADDITIONAL_ARGS"
ANALYZE_CMD_ARRAY=(python3 -m tools.analyze "run=analyze" "+eval_db_uri=wandb://artifacts/navigation_db" "analyzer.output_path=s3://softmax-public/policydash/dashboard.html" "+analyzer.num_output_policies=all")
ANALYZE_CMD_ARRAY+=( "${additional_args_analyze[@]}" )
echo "Executing: ${ANALYZE_CMD_ARRAY[*]}"
"${ANALYZE_CMD_ARRAY[@]}"

Note: Ensure that the variable ADDITIONAL_ARGS does not contain unintended extra characters and is properly defined. Review the rest of the script so the usage of read -r -a additional_args does not interfere with other parts of your code.

Comment on lines +26 to +27
# Get the directory where this script is located
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Usage of 'readlink -f' may have cross-platform issues.

macOS lacks 'readlink -f'. Consider using a portable method (e.g., realpath or a shell function) to determine the script directory.

Suggested change
# Get the directory where this script is located
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
# Get the directory where this script is located using a portable method
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd -P)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants