Skip to content

feat: Leaderboard Submission Script #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: base-sha/5350f947594f1393f0a46bccec214ffd94ca5dc1
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion configs/env/mettagrid/navigation/evals/cylinder_easy.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
defaults:
- /env/mettagrid/mettagrid@

- _self_

game:
num_agents: 20 #how many agents are in the map x2
max_steps: 500
Expand Down
3 changes: 2 additions & 1 deletion configs/env/mettagrid/navigation/evals/honeypot.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
defaults:
- /env/mettagrid/mettagrid@

- _self_

game:
num_agents: 20 #how many agents are in the map x2
max_steps: 300
Expand Down
3 changes: 2 additions & 1 deletion configs/env/mettagrid/navigation/evals/knotty.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
defaults:
- /env/mettagrid/mettagrid@

- _self_

game:
num_agents: 20 #how many agents are in the map x2
max_steps: 800
Expand Down
3 changes: 2 additions & 1 deletion configs/env/mettagrid/navigation/evals/memory_palace.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
defaults:
- /env/mettagrid/mettagrid@

- _self_

game:
num_agents: 20 #how many agents are in the map x2
max_steps: 300
Expand Down
3 changes: 2 additions & 1 deletion configs/env/mettagrid/navigation/evals/radial_large.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
defaults:
- /env/mettagrid/mettagrid@

- _self_

game:
num_agents: 20 #how many agents are in the map x2
max_steps: 1000
Expand Down
3 changes: 2 additions & 1 deletion configs/env/mettagrid/navigation/evals/radial_mini.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
defaults:
- /env/mettagrid/mettagrid@

- _self_

game:
num_agents: 20 #how many agents are in the map x2
max_steps: 300
Expand Down
3 changes: 2 additions & 1 deletion configs/env/mettagrid/navigation/evals/radial_small.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
defaults:
- /env/mettagrid/mettagrid@

- _self_

game:
num_agents: 20 #how many agents are in the map x2
max_steps: 200
Expand Down
3 changes: 2 additions & 1 deletion configs/env/mettagrid/navigation/evals/swirls.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
defaults:
- /env/mettagrid/mettagrid@

- _self_

game:
num_agents: 20 #how many agents are in the map x2
max_steps: 500
Expand Down
3 changes: 2 additions & 1 deletion configs/env/mettagrid/navigation/evals/thecube.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
defaults:
- /env/mettagrid/mettagrid@

- _self_

game:
num_agents: 20 #how many agents are in the map x2
max_steps: 500
Expand Down
3 changes: 2 additions & 1 deletion configs/env/mettagrid/navigation/evals/walkaround.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
defaults:
- /env/mettagrid/mettagrid@

- _self_

game:
num_agents: 20 #how many agents are in the map x2
max_steps: 400
Expand Down
3 changes: 2 additions & 1 deletion configs/env/mettagrid/navigation/evals/wanderout.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
defaults:
- /env/mettagrid/mettagrid@

- _self_

game:
num_agents: 20 #how many agents are in the map x2
max_steps: 800
Expand Down
86 changes: 86 additions & 0 deletions devops/add_to_leaderboard.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
#!/bin/bash

# Usage function for better help messages
usage() {
echo "Usage: $0 -r RUN_NAME [-w WANDB_PATH] [additional Hydra overrides]"
echo " -r RUN_NAME Your run name (e.g., b.$USER.test_run)"
echo " -w WANDB_PATH Optional: Full wandb path if different from auto-generated"
echo ""
echo " Any additional arguments will be passed directly to the Python commands"
echo " Example: $0 -r b.$USER.test_run +hardware=macbook"
exit 1
}

# Initialize variables
RUN_NAME=""
WANDB_PATH=""
ADDITIONAL_ARGS=""

# Parse command line arguments
while [[ $# -gt 0 ]]; do
case $1 in
-r|--run)
RUN_NAME="$2"
shift 2
;;
-w|--wandb)
WANDB_PATH="$2"
shift 2
;;
-h|--help)
usage
;;
*)
# Collect additional arguments
ADDITIONAL_ARGS="$ADDITIONAL_ARGS $1"
shift
;;
esac
done

# Check if run name is provided
if [ -z "$RUN_NAME" ]; then
echo "Error: Run name is required"
usage
fi

# Auto-generate wandb path if not provided
if [ -z "$WANDB_PATH" ]; then
WANDB_PATH="wandb://run/$RUN_NAME"
fi

echo "Adding policy to eval leaderboard with run name: $RUN_NAME"
echo "Using policy URI: $WANDB_PATH"
if [ ! -z "$ADDITIONAL_ARGS" ]; then
echo "Additional arguments: $ADDITIONAL_ARGS"
fi

# Step 1: Verifying policy exists on wandb
echo "Step 1: Verifying policy exists on wandb..."
# Add a check here if needed to verify the policy exists on wandb

# Step 2: Run the simulation
echo "Step 2: Running simulation..."
SIM_CMD="python3 -m tools.sim sim=navigation run=\"$RUN_NAME\" policy_uri=\"$WANDB_PATH\" +eval_db_uri=wandb://artifacts/navigation_db $ADDITIONAL_ARGS"
echo "Executing: $SIM_CMD"
eval $SIM_CMD
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 suggestion (security): Using eval for command execution may introduce security concerns.

Sanitize all arguments to prevent shell injection, or use an array-based command invocation instead of eval.

Suggested implementation:

# Step 2: Run the simulation using array-based command execution
echo "Step 2: Running simulation..."
# Split ADDITIONAL_ARGS into an array
read -r -a additional_args <<< "$ADDITIONAL_ARGS"
SIM_CMD_ARRAY=(python3 -m tools.sim sim=navigation "run=${RUN_NAME}" "policy_uri=${WANDB_PATH}" "+eval_db_uri=wandb://artifacts/navigation_db")
SIM_CMD_ARRAY+=( "${additional_args[@]}" )
echo "Executing: ${SIM_CMD_ARRAY[*]}"
"${SIM_CMD_ARRAY[@]}"

# Step 3: Analyze and update dashboard using array-based command execution
echo "Step 3: Analyzing results and updating dashboard..."
# Split ADDITIONAL_ARGS into an array for analyze (if needed)
read -r -a additional_args_analyze <<< "$ADDITIONAL_ARGS"
ANALYZE_CMD_ARRAY=(python3 -m tools.analyze "run=analyze" "+eval_db_uri=wandb://artifacts/navigation_db" "analyzer.output_path=s3://softmax-public/policydash/dashboard.html" "+analyzer.num_output_policies=all")
ANALYZE_CMD_ARRAY+=( "${additional_args_analyze[@]}" )
echo "Executing: ${ANALYZE_CMD_ARRAY[*]}"
"${ANALYZE_CMD_ARRAY[@]}"

Note: Ensure that the variable ADDITIONAL_ARGS does not contain unintended extra characters and is properly defined. Review the rest of the script so the usage of read -r -a additional_args does not interfere with other parts of your code.


# Check if the sim was successful
if [ $? -ne 0 ]; then
echo "Error: Simulation failed. Exiting."
exit 1
fi

# Step 3: Analyze and update dashboard
echo "Step 3: Analyzing results and updating dashboard..."
ANALYZE_CMD="python3 -m tools.analyze run=analyze +eval_db_uri=wandb://artifacts/navigation_db analyzer.output_path=s3://softmax-public/policydash/dashboard.html +analyzer.num_output_policies=\"all\" $ADDITIONAL_ARGS"
echo "Executing: $ANALYZE_CMD"
eval $ANALYZE_CMD

if [ $? -ne 0 ]; then
echo "Error: Analysis failed. Exiting."
exit 1
fi

echo "Successfully added policy to leaderboard and updated dashboard!"
echo "Dashboard URL: https://softmax-public.s3.amazonaws.com/policydash/dashboard.html"
62 changes: 62 additions & 0 deletions devops/build_mettagrid.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#!/bin/bash

# This script rebuilds mettagrid without rebuilding other dependencies

# Exit immediately if a command exits with a non-zero status
set -e

# Parse command line arguments
CLEAN=0
for arg in "$@"; do
case $arg in
--clean)
CLEAN=1
shift
;;
esac
done

# Display appropriate header based on clean flag
if [ "$CLEAN" -eq 1 ]; then
echo "========== Rebuilding mettagrid (clean) =========="
else
echo "========== Rebuilding mettagrid =========="
fi

# Get the directory where this script is located
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
Comment on lines +26 to +27
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Usage of 'readlink -f' may have cross-platform issues.

macOS lacks 'readlink -f'. Consider using a portable method (e.g., realpath or a shell function) to determine the script directory.

Suggested change
# Get the directory where this script is located
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
# Get the directory where this script is located using a portable method
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd -P)"


# Go to the project root directory
cd "$SCRIPT_DIR/.."

# Check if deps/mettagrid exists
if [ ! -d "deps/mettagrid" ]; then
echo "Error: mettagrid directory not found at deps/mettagrid"
echo "Make sure you have run the full dependency installation script first."
exit 1
fi

# Navigate to mettagrid directory
cd deps/mettagrid

echo "Building mettagrid in $(pwd)"

# Clean build artifacts only if --clean flag is specified
if [ "$CLEAN" -eq 1 ]; then
echo "Cleaning previous build artifacts..."
rm -rf build
find . -name "*.so" -delete
echo "Clean completed."
else
echo "Skipping clean (use --clean to remove previous build artifacts)"
fi

# Rebuild mettagrid
echo "Rebuilding mettagrid..."
python setup.py build_ext --inplace

# Reinstall in development mode
echo "Reinstalling mettagrid in development mode..."
pip install -e .

echo "========== mettagrid rebuild complete =========="
11 changes: 5 additions & 6 deletions devops/checkout_and_build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -118,12 +118,11 @@ mkdir -p deps
cd deps

# ========== METTAGRID ==========
# Note that version control for the mettagrid package has been brought into our monorepo
cd mettagrid
echo "Building mettagrid into $(pwd)"
python setup.py build_ext --inplace
pip install -e .
cd ..
# Call the dedicated build_mettagrid.sh script instead of building directly
echo "Building mettagrid using devops/build_mettagrid.sh"
cd .. # Go back to project root
devops/build_mettagrid.sh
cd deps # Return to deps directory for remaining dependencies

# Install dependencies using the function
install_repo "fast_gae" $FAST_GAE_REPO "main" "python setup.py build_ext --inplace && pip install -e ."
Expand Down
80 changes: 59 additions & 21 deletions metta/agent/policy_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,6 @@ def _policy_records(self, uri, selector_type="top", n=1, metric: str = "score"):
prs = self._prs_from_wandb_sweep(sweep_name, version)
else:
prs = self._prs_from_wandb_artifact(wandb_uri, version)

elif uri.startswith("file://"):
prs = self._prs_from_path(uri[len("file://") :])
elif uri.startswith("puffer://"):
Expand All @@ -104,42 +103,75 @@ def _policy_records(self, uri, selector_type="top", n=1, metric: str = "score"):
if len(prs) == 0:
raise ValueError(f"No policies found at {uri}")

logger.info(f"Found {len(prs)} policies at {uri}")

if selector_type == "all":
logger.info(f"Returning all {len(prs)} policies")
return prs

elif selector_type == "latest":
return [prs[0]]

selected = [prs[0]]
logger.info(f"Selected latest policy: {selected[0].name}")
return selected
elif selector_type == "rand":
return [random.choice(prs)]

selected = [random.choice(prs)]
logger.info(f"Selected random policy: {selected[0].name}")
return selected
elif selector_type == "top":
if metric not in prs[0].metadata:
# check if the metric is in eval_scores
if "eval_scores" in prs[0].metadata and metric in prs[0].metadata["eval_scores"]:
policy_scores = {p: p.metadata["eval_scores"].get(metric, None) for p in prs}
else:
logger.warning(f"Metric {metric} not found in policy metadata, returning latest policy")
return [prs[0]] #
else:
if (
"eval_scores" in prs[0].metadata
and prs[0].metadata["eval_scores"] is not None
and metric in prs[0].metadata["eval_scores"]
):
# Metric is in eval_scores
logger.info(f"Found metric '{metric}' in metadata['eval_scores']")
policy_scores = {p: p.metadata.get("eval_scores", {}).get(metric, None) for p in prs}
elif metric in prs[0].metadata:
# Metric is directly in metadata
logger.info(f"Found metric '{metric}' directly in metadata")
policy_scores = {p: p.metadata.get(metric, None) for p in prs}
else:
# Metric not found anywhere
logger.warning(
f"Metric '{metric}' not found in policy metadata or eval_scores, returning latest policy"
)
selected = [prs[0]]
logger.info(f"Selected latest policy (due to missing metric): {selected[0].name}")
return selected

policies_with_scores = [p for p, s in policy_scores.items() if s is not None]

# If more than 20% of the policies have no score, return the latest policy
if len(policies_with_scores) < len(prs) * 0.8:
logger.warning("Too many invalid scores, returning latest policy")
return [prs[0]] # return latest if metric not found
top = sorted(policies_with_scores, key=lambda p: policy_scores[p])[-n:]
selected = [prs[0]] # return latest if metric not found
logger.info(f"Selected latest policy (due to too many invalid scores): {selected[0].name}")
return selected

# Sort by metric score (assuming higher is better)
def get_policy_score(policy: PolicyRecord) -> float: # Explicitly return a comparable type
score = policy_scores.get(policy)
if score is None:
return float("-inf") # Or another appropriate default
return score

top = sorted(policies_with_scores, key=get_policy_score)[-n:]

if len(top) < n:
logger.warning(f"Only found {len(top)} policies matching criteria, requested {n}")

logger.info(f"Top {n} policies by {metric}:")
logger.info(f"Top {len(top)} policies by {metric}:")
logger.info(f"{'Policy':<40} | {metric:<20}")
logger.info("-" * 62)
for pr in top:
logger.info(f"{pr.name:<40} | {pr.metadata.get(metric, 0):<20.4f}")
score = policy_scores[pr]
logger.info(f"{pr.name:<40} | {score:<20.4f}")

selected = top[-n:]
logger.info(f"Selected {len(selected)} top policies by {metric}")
for i, pr in enumerate(selected):
logger.info(f" {i + 1}. {pr.name} (score: {policy_scores[pr]:.4f})")

return top[-n:]
return selected
else:
raise ValueError(f"Invalid selector type {selector_type}")

Expand Down Expand Up @@ -180,10 +212,16 @@ def save(self, name: str, path: str, policy: nn.Module, metadata: dict):
return pr

def add_to_wandb_run(self, run_id: str, pr: PolicyRecord, additional_files=None):
return self.add_to_wandb_artifact(run_id, "model", pr.metadata, pr.local_path(), additional_files)
local_path = pr.local_path()
if local_path is None:
raise ValueError("PolicyRecord has no local path")
return self.add_to_wandb_artifact(run_id, "model", pr.metadata, local_path, additional_files)

def add_to_wandb_sweep(self, sweep_name: str, pr: PolicyRecord, additional_files=None):
return self.add_to_wandb_artifact(sweep_name, "sweep_model", pr.metadata, pr.local_path(), additional_files)
local_path = pr.local_path()
if local_path is None:
raise ValueError("PolicyRecord has no local path")
return self.add_to_wandb_artifact(sweep_name, "sweep_model", pr.metadata, local_path, additional_files)

def add_to_wandb_artifact(self, name: str, type: str, metadata: dict, local_path: str, additional_files=None):
if self._wandb_run is None:
Expand Down
2 changes: 1 addition & 1 deletion metta/rl/pufferlib/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ def __init__(
self.eval_stats_logger = EvalStatsLogger(self.sim_suite_config, wandb_run)
self.average_reward = 0.0 # Initialize average reward estimate
self._current_eval_score = None
self.eval_scores = None
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Changed initialization of eval_scores to an empty dictionary.

Verify downstream code treats eval_scores as a dict—this avoids null checks but requires consistent usage.

Suggested implementation:

    if not self.eval_scores:
    self.eval_scores["latest"] = current_score

Depending on your downstream code usage you may have to:

  1. Change all conditional checks that compare self.eval_scores to None (e.g., “if self.eval_scores is None:”) into checks for emptiness (e.g., “if not self.eval_scores:”).
  2. Replace any method calls (like append, extend, etc.) on self.eval_scores with dictionary updates that use keys.
  3. Ensure that downstream code which reads from eval_scores uses the correct dictionary key(s) rather than assuming a list.
    Adjust the key names (“latest” in the example) to match the intended logic of your evaluation scoring.

self.eval_scores = {}
self._eval_results = []
self._weights_helper = WeightsMetricsHelper(cfg)
self._make_vecenv()
Expand Down