feat: Leaderboard Submission Script #1

Hellebore · 2025-05-02T10:44:57Z

This PR introduces a new utility script (devops/add_to_leaderboard.sh) that streamlines the process of adding policies to our navigation evaluation leaderboard. The script handles the end-to-end workflow of submitting a policy, running the navigation simulation, and updating the dashboard.

Changes:

Add new shell script add_to_leaderboard.sh with clear documentation and error handling
Update selector type in sim_job.yaml from "top" to "latest" for consistent evaluation
Fix bug in policy_store.py to properly handle missing eval scores in metadata

Usage:

devops/add_to_leaderboard.sh -r <run_name> [+hardware=macbook] [other Hydra overrides]

Example:

devops/add_to_leaderboard.sh -r b.rwalters.64r +hardware=macbook

Dashboard URL

Summary by Sourcery

Introduce a new leaderboard submission script and improve policy selection and build processes for navigation evaluation

New Features:

Add new utility script add_to_leaderboard.sh for streamlined policy submission and evaluation

Bug Fixes:

Fix handling of missing evaluation scores in policy metadata
Add proper error handling for policy record local path

Enhancements:

Improve policy record selection logic in policy_store.py
Add dedicated build script for mettagrid
Update navigation evaluation configs to include self-defaults

Deployment:

Create script to automate leaderboard policy submission
Update dashboard generation process

Chores:

Refactor dependency build process
Improve logging in policy selection

Hellebore · 2025-05-02T10:45:00Z

This is a benchmark review for experiment bakeoff.
Run ID: bakeoff/benchmark_2025-05-02T11-38-04_v1-36-0-dirty.

This pull request was cloned from https://github.com/Metta-AI/metta/pull/277. (Note: the URL is not a link to avoid triggering a notification on the original pull request.)

Experiment configuration

review_config:
  # User configuration for the review
  # - benchmark - use the user config from the benchmark reviews
  # - <value> - use the value directly
  user_review_config:
    enable_ai_review: true
    enable_rule_comments: false

    enable_complexity_comments: benchmark
    enable_security_comments: benchmark
    enable_tests_comments: benchmark
    enable_comment_suggestions: benchmark

    enable_pull_request_summary: benchmark
    enable_review_guide: benchmark

    enable_approvals: false
    base_branches: [base-sha.*]

  ai_review_config:
    # The model responses to use for the experiment
    # - benchmark - use the model responses from the benchmark reviews
    # - llm - call the language model to generate responses
    model_responses:
      comments_model: benchmark
      comment_validation_model: benchmark
      comment_suggestion_model: benchmark
      complexity_model: benchmark
      security_model: benchmark
      tests_model: benchmark
      pull_request_summary_model: benchmark
      review_guide_model: benchmark
      overall_comments_model: benchmark

# The pull request dataset to run the experiment on
pull_request_dataset:
# CodeRabbit
- https://github.com/neerajkumar161/node-coveralls-integration/pull/5
- https://github.com/gunner95/vertx-rest/pull/1
- https://github.com/Altinn/altinn-access-management-frontend/pull/1427
- https://github.com/theMr17/github-notifier/pull/14
- https://github.com/bearycool11/AI_memory_Loops/pull/142

# Greptile
- https://github.com/gumloop/guMCP/pull/119
- https://github.com/autoblocksai/python-sdk/pull/335
- https://github.com/grepdemos/ImageSharp/pull/6
- https://github.com/grepdemos/server/pull/61
- https://github.com/websentry-ai/pipelines/pull/25

# Graphite
- https://github.com/KittyCAD/modeling-app/pull/6648
- https://github.com/KittyCAD/modeling-app/pull/6628
- https://github.com/Varedis-Org/AI-Test-Repo/pull/2
- https://github.com/deeep-network/bedrock/pull/198
- https://github.com/Metta-AI/metta/pull/277

# Copilot
- https://github.com/hmcts/rpx-xui-webapp/pull/4438
- https://github.com/ganchdev/quez/pull/104
- https://github.com/xbcsmith/ymlfxr/pull/13
- https://github.com/tinapayy/B-1N1T/pull/36
- https://github.com/coder/devcontainer-features/pull/6

# Questions to ask to label the review comments
review_comment_labels: []
# - label: correct
#   question: Is this comment correct?

# Benchmark reviews generated by running
#   python -m scripts.experiment benchmark <experiment_name>
benchmark_reviews: []

Hellebore · 2025-05-02T10:46:56Z

Reviewer's Guide

This pull request introduces a new shell script (devops/add_to_leaderboard.sh) that automates the process of evaluating a policy and adding it to the leaderboard by orchestrating existing simulation (tools.sim) and analysis (tools.analyze) tools. Supporting changes include refactoring the policy selection logic in policy_store.py for robustness against missing evaluation scores, extracting the mettagrid build process into a dedicated script, and making minor updates to Hydra configurations and trainer initialization.

File-Level Changes

Change	Details	Files
Added a new shell script to automate the leaderboard submission workflow.	Parses command-line arguments for run name and optional wandb path. Executes `tools.sim` to run the navigation simulation for the specified policy. Executes `tools.analyze` to process simulation results and update the dashboard. Includes basic error handling for simulation and analysis steps.	`devops/add_to_leaderboard.sh`
Refactored policy selection logic to improve robustness and logging.	Enhanced the `top` selector logic to check for metrics in both `eval_scores` and top-level metadata. Implemented fallback to the `latest` policy if the metric is missing or too many scores are invalid. Added detailed logging throughout the policy selection process. Added checks for `None` local paths before adding to wandb artifacts.	`metta/agent/policy_store.py`
Extracted the `mettagrid` build process into a separate script.	Created `devops/build_mettagrid.sh` to handle `mettagrid` compilation and installation. Updated `devops/checkout_and_build.sh` to call the new dedicated build script.	`devops/build_mettagrid.sh` `devops/checkout_and_build.sh`
Updated Hydra evaluation configurations and trainer initialization.	Added `_self_` default to multiple Hydra evaluation config files. Initialized `eval_scores` attribute in `Trainer` to an empty dictionary.	`configs/env/mettagrid/navigation/evals/cylinder_easy.yaml` `configs/env/mettagrid/navigation/evals/honeypot.yaml` `configs/env/mettagrid/navigation/evals/knotty.yaml` `configs/env/mettagrid/navigation/evals/memory_palace.yaml` `configs/env/mettagrid/navigation/evals/radial_large.yaml` `configs/env/mettagrid/navigation/evals/radial_mini.yaml` `configs/env/mettagrid/navigation/evals/radial_small.yaml` `configs/env/mettagrid/navigation/evals/swirls.yaml` `configs/env/mettagrid/navigation/evals/thecube.yaml` `configs/env/mettagrid/navigation/evals/walkaround.yaml` `configs/env/mettagrid/navigation/evals/wanderout.yaml` `metta/rl/pufferlib/trainer.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

Hellebore

Hey @Hellebore - I've reviewed your changes and found some issues that need to be addressed.

Blocking issues:

Using eval for command execution may introduce security concerns. (link)
The logic within the _policy_records method, particularly for the "top" selector, has grown complex; consider refactoring parts of this logic into smaller helper functions.
The add_to_leaderboard.sh script uses eval to run constructed commands; consider executing the commands directly to avoid potential quoting issues or security risks associated with eval.

Here's what I looked at during the review

🟡 General issues: 2 issues found
🔴 Security: 1 blocking issue
🟢 Testing: all looks good
🟢 Documentation: all looks good

LangSmith trace

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

Hellebore · 2025-05-02T10:46:58Z

metta/rl/pufferlib/trainer.py

@@ -74,7 +74,7 @@ def __init__(
        self.eval_stats_logger = EvalStatsLogger(self.sim_suite_config, wandb_run)
        self.average_reward = 0.0  # Initialize average reward estimate
        self._current_eval_score = None
-        self.eval_scores = None


suggestion (bug_risk): Changed initialization of eval_scores to an empty dictionary.

Verify downstream code treats eval_scores as a dict—this avoids null checks but requires consistent usage.

Suggested implementation:

if not self.eval_scores:

self.eval_scores["latest"] = current_score

Depending on your downstream code usage you may have to:

Change all conditional checks that compare self.eval_scores to None (e.g., “if self.eval_scores is None:”) into checks for emptiness (e.g., “if not self.eval_scores:”).

Replace any method calls (like append, extend, etc.) on self.eval_scores with dictionary updates that use keys.

Ensure that downstream code which reads from eval_scores uses the correct dictionary key(s) rather than assuming a list.
Adjust the key names (“latest” in the example) to match the intended logic of your evaluation scoring.

Hellebore · 2025-05-02T10:46:58Z

devops/add_to_leaderboard.sh

+echo "Step 2: Running simulation..."
+SIM_CMD="python3 -m tools.sim sim=navigation run=\"$RUN_NAME\" policy_uri=\"$WANDB_PATH\" +eval_db_uri=wandb://artifacts/navigation_db $ADDITIONAL_ARGS"
+echo "Executing: $SIM_CMD"
+eval $SIM_CMD


🚨 suggestion (security): Using eval for command execution may introduce security concerns.

Sanitize all arguments to prevent shell injection, or use an array-based command invocation instead of eval.

Suggested implementation:

# Step 2: Run the simulation using array-based command execution echo "Step 2: Running simulation..." # Split ADDITIONAL_ARGS into an array read -r -a additional_args <<< "$ADDITIONAL_ARGS" SIM_CMD_ARRAY=(python3 -m tools.sim sim=navigation "run=${RUN_NAME}" "policy_uri=${WANDB_PATH}" "+eval_db_uri=wandb://artifacts/navigation_db") SIM_CMD_ARRAY+=( "${additional_args[@]}" ) echo "Executing: ${SIM_CMD_ARRAY[*]}" "${SIM_CMD_ARRAY[@]}"

# Step 3: Analyze and update dashboard using array-based command execution echo "Step 3: Analyzing results and updating dashboard..." # Split ADDITIONAL_ARGS into an array for analyze (if needed) read -r -a additional_args_analyze <<< "$ADDITIONAL_ARGS" ANALYZE_CMD_ARRAY=(python3 -m tools.analyze "run=analyze" "+eval_db_uri=wandb://artifacts/navigation_db" "analyzer.output_path=s3://softmax-public/policydash/dashboard.html" "+analyzer.num_output_policies=all") ANALYZE_CMD_ARRAY+=( "${additional_args_analyze[@]}" ) echo "Executing: ${ANALYZE_CMD_ARRAY[*]}" "${ANALYZE_CMD_ARRAY[@]}"

Note: Ensure that the variable ADDITIONAL_ARGS does not contain unintended extra characters and is properly defined. Review the rest of the script so the usage of read -r -a additional_args does not interfere with other parts of your code.

Hellebore · 2025-05-02T10:46:58Z

devops/build_mettagrid.sh

+# Get the directory where this script is located
+SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"


suggestion (bug_risk): Usage of 'readlink -f' may have cross-platform issues.

macOS lacks 'readlink -f'. Consider using a portable method (e.g., realpath or a shell function) to determine the script directory.

Suggested change

# Get the directory where this script is located

SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"

# Get the directory where this script is located using a portable method

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd -P)"

rjwalters and others added 8 commits May 1, 2025 15:48

feat: add to leaderboard script

075b80a

Merge branch 'main' into robb/0501-leaderboard-script

8eddf00

fix: split out build_mettagrid script

3dd0fac

fix: add _self_ to defaults to avoid warnings

0188fda

fix: debug

294da36

Merge branch 'main' into robb/0501-leaderboard-script

d529eb6

fix: change eval_scores initializer to {}

a712595

Merge branch 'main' into robb/0501-leaderboard-script

8a7efb2

Hellebore commented May 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Leaderboard Submission Script #1

feat: Leaderboard Submission Script #1

Uh oh!

Hellebore commented May 2, 2025 •

edited

Loading

Uh oh!

Hellebore commented May 2, 2025

Uh oh!

Hellebore commented May 2, 2025

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Hellebore left a comment

Uh oh!

Hellebore May 2, 2025

Uh oh!

Hellebore May 2, 2025

Uh oh!

Hellebore May 2, 2025

Uh oh!

Uh oh!

		# Get the directory where this script is located
		SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"

feat: Leaderboard Submission Script #1

Are you sure you want to change the base?

feat: Leaderboard Submission Script #1

Uh oh!

Conversation

Hellebore commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes:

Usage:

Summary by Sourcery

Uh oh!

Hellebore commented May 2, 2025

Uh oh!

Hellebore commented May 2, 2025

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Hellebore left a comment

Choose a reason for hiding this comment

Uh oh!

Hellebore May 2, 2025

Choose a reason for hiding this comment

Uh oh!

Hellebore May 2, 2025

Choose a reason for hiding this comment

Uh oh!

Hellebore May 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Hellebore commented May 2, 2025 •

edited

Loading