-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Leaderboard Submission Script #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: base-sha/5350f947594f1393f0a46bccec214ffd94ca5dc1
Are you sure you want to change the base?
Conversation
This is a benchmark review for experiment This pull request was cloned from Experiment configurationreview_config:
# User configuration for the review
# - benchmark - use the user config from the benchmark reviews
# - <value> - use the value directly
user_review_config:
enable_ai_review: true
enable_rule_comments: false
enable_complexity_comments: benchmark
enable_security_comments: benchmark
enable_tests_comments: benchmark
enable_comment_suggestions: benchmark
enable_pull_request_summary: benchmark
enable_review_guide: benchmark
enable_approvals: false
base_branches: [base-sha.*]
ai_review_config:
# The model responses to use for the experiment
# - benchmark - use the model responses from the benchmark reviews
# - llm - call the language model to generate responses
model_responses:
comments_model: benchmark
comment_validation_model: benchmark
comment_suggestion_model: benchmark
complexity_model: benchmark
security_model: benchmark
tests_model: benchmark
pull_request_summary_model: benchmark
review_guide_model: benchmark
overall_comments_model: benchmark
# The pull request dataset to run the experiment on
pull_request_dataset:
# CodeRabbit
- https://github.com/neerajkumar161/node-coveralls-integration/pull/5
- https://github.com/gunner95/vertx-rest/pull/1
- https://github.com/Altinn/altinn-access-management-frontend/pull/1427
- https://github.com/theMr17/github-notifier/pull/14
- https://github.com/bearycool11/AI_memory_Loops/pull/142
# Greptile
- https://github.com/gumloop/guMCP/pull/119
- https://github.com/autoblocksai/python-sdk/pull/335
- https://github.com/grepdemos/ImageSharp/pull/6
- https://github.com/grepdemos/server/pull/61
- https://github.com/websentry-ai/pipelines/pull/25
# Graphite
- https://github.com/KittyCAD/modeling-app/pull/6648
- https://github.com/KittyCAD/modeling-app/pull/6628
- https://github.com/Varedis-Org/AI-Test-Repo/pull/2
- https://github.com/deeep-network/bedrock/pull/198
- https://github.com/Metta-AI/metta/pull/277
# Copilot
- https://github.com/hmcts/rpx-xui-webapp/pull/4438
- https://github.com/ganchdev/quez/pull/104
- https://github.com/xbcsmith/ymlfxr/pull/13
- https://github.com/tinapayy/B-1N1T/pull/36
- https://github.com/coder/devcontainer-features/pull/6
# Questions to ask to label the review comments
review_comment_labels: []
# - label: correct
# question: Is this comment correct?
# Benchmark reviews generated by running
# python -m scripts.experiment benchmark <experiment_name>
benchmark_reviews: []
|
Reviewer's GuideThis pull request introduces a new shell script ( File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Hellebore - I've reviewed your changes and found some issues that need to be addressed.
Blocking issues:
-
Using eval for command execution may introduce security concerns. (link)
-
The logic within the
_policy_records
method, particularly for the "top" selector, has grown complex; consider refactoring parts of this logic into smaller helper functions. -
The
add_to_leaderboard.sh
script useseval
to run constructed commands; consider executing the commands directly to avoid potential quoting issues or security risks associated witheval
.
Here's what I looked at during the review
- 🟡 General issues: 2 issues found
- 🔴 Security: 1 blocking issue
- 🟢 Testing: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
@@ -74,7 +74,7 @@ def __init__( | |||
self.eval_stats_logger = EvalStatsLogger(self.sim_suite_config, wandb_run) | |||
self.average_reward = 0.0 # Initialize average reward estimate | |||
self._current_eval_score = None | |||
self.eval_scores = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Changed initialization of eval_scores to an empty dictionary.
Verify downstream code treats eval_scores as a dict—this avoids null checks but requires consistent usage.
Suggested implementation:
if not self.eval_scores:
self.eval_scores["latest"] = current_score
Depending on your downstream code usage you may have to:
- Change all conditional checks that compare self.eval_scores to None (e.g., “if self.eval_scores is None:”) into checks for emptiness (e.g., “if not self.eval_scores:”).
- Replace any method calls (like append, extend, etc.) on self.eval_scores with dictionary updates that use keys.
- Ensure that downstream code which reads from eval_scores uses the correct dictionary key(s) rather than assuming a list.
Adjust the key names (“latest” in the example) to match the intended logic of your evaluation scoring.
echo "Step 2: Running simulation..." | ||
SIM_CMD="python3 -m tools.sim sim=navigation run=\"$RUN_NAME\" policy_uri=\"$WANDB_PATH\" +eval_db_uri=wandb://artifacts/navigation_db $ADDITIONAL_ARGS" | ||
echo "Executing: $SIM_CMD" | ||
eval $SIM_CMD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚨 suggestion (security): Using eval for command execution may introduce security concerns.
Sanitize all arguments to prevent shell injection, or use an array-based command invocation instead of eval.
Suggested implementation:
# Step 2: Run the simulation using array-based command execution
echo "Step 2: Running simulation..."
# Split ADDITIONAL_ARGS into an array
read -r -a additional_args <<< "$ADDITIONAL_ARGS"
SIM_CMD_ARRAY=(python3 -m tools.sim sim=navigation "run=${RUN_NAME}" "policy_uri=${WANDB_PATH}" "+eval_db_uri=wandb://artifacts/navigation_db")
SIM_CMD_ARRAY+=( "${additional_args[@]}" )
echo "Executing: ${SIM_CMD_ARRAY[*]}"
"${SIM_CMD_ARRAY[@]}"
# Step 3: Analyze and update dashboard using array-based command execution
echo "Step 3: Analyzing results and updating dashboard..."
# Split ADDITIONAL_ARGS into an array for analyze (if needed)
read -r -a additional_args_analyze <<< "$ADDITIONAL_ARGS"
ANALYZE_CMD_ARRAY=(python3 -m tools.analyze "run=analyze" "+eval_db_uri=wandb://artifacts/navigation_db" "analyzer.output_path=s3://softmax-public/policydash/dashboard.html" "+analyzer.num_output_policies=all")
ANALYZE_CMD_ARRAY+=( "${additional_args_analyze[@]}" )
echo "Executing: ${ANALYZE_CMD_ARRAY[*]}"
"${ANALYZE_CMD_ARRAY[@]}"
Note: Ensure that the variable ADDITIONAL_ARGS does not contain unintended extra characters and is properly defined. Review the rest of the script so the usage of read -r -a additional_args does not interfere with other parts of your code.
# Get the directory where this script is located | ||
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Usage of 'readlink -f' may have cross-platform issues.
macOS lacks 'readlink -f'. Consider using a portable method (e.g., realpath or a shell function) to determine the script directory.
# Get the directory where this script is located | |
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")" | |
# Get the directory where this script is located using a portable method | |
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd -P)" |
This PR introduces a new utility script (
devops/add_to_leaderboard.sh
) that streamlines the process of adding policies to our navigation evaluation leaderboard. The script handles the end-to-end workflow of submitting a policy, running the navigation simulation, and updating the dashboard.Changes:
add_to_leaderboard.sh
with clear documentation and error handlingsim_job.yaml
from "top" to "latest" for consistent evaluationpolicy_store.py
to properly handle missing eval scores in metadataUsage:
Example:
Dashboard URL
Summary by Sourcery
Introduce a new leaderboard submission script and improve policy selection and build processes for navigation evaluation
New Features:
add_to_leaderboard.sh
for streamlined policy submission and evaluationBug Fixes:
Enhancements:
policy_store.py
Deployment:
Chores: