feat: benchmarks - results gathering and visualization #542

jmatejcz · 2025-04-24T13:46:24Z

Purpose

Update what and how is saved as results from benchmarks.
Create script for visualizing these results in a meaningful way

Proposed Changes

Base benchmark class for all bechmarks for results gathering and saving unification
testing models via src/rai_bench/rai_bench/examples/test_models.py will now produce seperate folder for every run, structured like:
└── run_2023-04-30_12-36-49/
├── <bench_name_1>/
│ ├── <model_name_1>/
│ │ ├── benchmark.log
│ │ ├── results_summary.csv
│ │ └── results.csv
│ └── <model_name_2>/
│ ├── benchmark.log
│ ├── results_summary.csv
│ └── results.csv
└── <bench_name_2>/
├── <model_name_1>/
│ ├── benchmark.log
│ ├── results_summary.csv
│ └── results.csv
└── <model_name_2>/
├── benchmark.log
├── results_summary.csv
└── results.csv
Result summary file ( results_summary.csv by default ) will be same for all benchmarks
Detailed results (results.csv by default) will be different for every benchmark
extra tool calls arg added to running benchmarks(for now in maniupulation o3de it isn't implemented)
results processing logic added in rai_bench/results_processing

Issues

#530
#531

Testing

get some results:

python src/rai_bench/rai_bench/examples/test_models.py

then run:

streamlit run src/rai_bench/rai_bench/visualise_results.py

Summary by CodeRabbit

New Features
- Introduced an interactive Streamlit-based visualization interface for benchmark results, enabling users to analyze and compare model and task performance with charts and tables.
- Added comprehensive data loading and processing utilities for structured benchmark result analysis.
- Added new result tracking and summary features for both manipulation and tool-calling agent benchmarks.
Enhancements
- Unified and clarified prompt and scoring interfaces for tasks.
- Added support for tracking extra tool calls and detailed validation results.
- Improved documentation with a new section on results storage and visualization.
Bug Fixes
- Improved error handling and state management in task validators.
Refactor
- Refactored benchmark classes for consistency and extensibility.
- Standardized result data models and CSV handling.
Chores
- Updated dependencies to include Plotly for visualization.
- Added package markers and license headers.
Tests
- Enhanced test coverage and rigor for validators and subtasks, including new helper functions and detailed state assertions.

coderabbitai · 2025-04-29T09:34:09Z

Walkthrough

This update introduces a comprehensive overhaul of the benchmarking infrastructure, focusing on result tracking, analysis, and visualization. It standardizes result schemas using Pydantic models, establishes a new BaseBenchmark class for consistent benchmark management, and refactors both the manipulation and tool-calling agent benchmarks to inherit from this base. The prompt interface for tasks is unified, splitting prompts into system_prompt and task_prompt properties, and scoring methods are renamed for clarity. Result tracking is now explicit and structured, with new modules for loading, processing, and visualizing results via Streamlit and Plotly. The update also enhances validator and subtask tracking, improves test coverage, and documents result handling in the README.

Changes

File(s) / Path(s)	Change Summary
`src/rai_bench/README.md`	Added documentation on where benchmark results are saved and how to visualize them using Streamlit.
`src/rai_bench/pyproject.toml`	Added `plotly` as a dependency.
`src/rai_bench/rai_bench/results_processing/__init__.py`	Added new empty `__init__.py` with license header to mark the results_processing directory as a package.
`src/rai_bench/rai_bench/results_processing/data_loading.py` `src/rai_bench/rai_bench/results_processing/data_processing.py` `src/rai_bench/rai_bench/results_processing/visualise.py`	Introduced new modules for loading, processing, and visualizing benchmark results. These modules provide structured loading of CSV results into Python objects, aggregation and analysis utilities, and an interactive Streamlit interface with Plotly charts for exploring model and task performance.
`src/rai_bench/rai_bench/base_benchmark.py`	Added a new abstract `BaseBenchmark` class with CSV utilities and a standard summary schema (`BenchmarkSummary`). All benchmarks now inherit from this class.
`src/rai_bench/rai_bench/manipulation_o3de/benchmark.py` `src/rai_bench/rai_bench/tool_calling_agent/benchmark.py`	Refactored benchmarks to inherit from `BaseBenchmark`, use structured result tracking, and support summary computation and saving after each task or scenario. Updated method signatures and result handling accordingly.
`src/rai_bench/rai_bench/manipulation_o3de/results_tracking.py` `src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py`	Introduced new Pydantic data models for scenario and task results, validator and subtask results, with typed fields and descriptions.
`src/rai_bench/rai_bench/manipulation_o3de/interfaces.py` `src/rai_bench/rai_bench/manipulation_o3de/tasks/build_tower_task.py` `src/rai_bench/rai_bench/manipulation_o3de/tasks/group_objects_task.py` `src/rai_bench/rai_bench/manipulation_o3de/tasks/move_object_to_left_task.py` `src/rai_bench/rai_bench/manipulation_o3de/tasks/place_at_coord_task.py` `src/rai_bench/rai_bench/manipulation_o3de/tasks/place_cubes_task.py` `src/rai_bench/rai_bench/manipulation_o3de/tasks/rotate_object_task.py`	Unified prompt interface for tasks: replaced `get_prompt()` with `task_prompt` property and added `system_prompt` property. Renamed `calculate_result` to `calculate_score`.
`src/rai_bench/rai_bench/examples/manipulation_o3de/main.py` `src/rai_bench/rai_bench/examples/manipulation_o3de/scenarios.py`	Updated imports and usage to use `ManipulationO3DEBenchmark` instead of `Benchmark`. Refactored argument parsing and output directory handling.
`src/rai_bench/rai_bench/examples/tool_calling_agent/main.py` `src/rai_bench/rai_bench/examples/tool_calling_agent/tasks.py`	Refactored task list construction to use functions with `extra_tool_calls` parameter. Updated argument parsing and benchmark invocation to support extra tool calls and output directory configuration.
`src/rai_bench/rai_bench/examples/test_models.py`	Refactored benchmark execution to use timestamped output directories and support extra tool calls parameterization.
`src/rai_bench/rai_bench/tool_calling_agent/interfaces.py` `src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py` `src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py` `src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py` `src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py` `src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py` `src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py`	Refactored subtask and validator interfaces: replaced `dump()` with `info` property, added `type` properties to tasks and validators, improved error and pass state tracking, and enforced explicit state resets. Validators now track extra tool calls and pass/fail states per subtask.
`src/rai_bench/rai_bench/tool_calling_agent/validators.py`	Added `type` properties to validators, improved error handling and pass/fail state management, enforced minimum subtask count, and refactored validation logic to aggregate errors and track extra calls.
`tests/rai_bench/tool_calling_agent/test_subtasks.py` `tests/rai_bench/tool_calling_agent/test_validators.py` `tests/rai_bench/manipulation_o3de/tasks/test_task.py`	Updated tests for new interfaces: replaced `dump()` tests with `info` property checks, enhanced validator tests to assert detailed internal state and result dumps, and updated dummy task to use new properties and method names.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant StreamlitUI
    participant DataLoader
    participant DataProcessor
    participant Plotly

    User->>StreamlitUI: Select run/benchmark/model
    StreamlitUI->>DataLoader: Load results from disk
    DataLoader-->>StreamlitUI: Structured RunResults/BenchmarkResults
    StreamlitUI->>DataProcessor: Aggregate/analyze results
    DataProcessor-->>StreamlitUI: DataFrames/statistics
    StreamlitUI->>Plotly: Generate charts
    Plotly-->>StreamlitUI: Rendered plots
    StreamlitUI-->>User: Display tables, charts, analysis

sequenceDiagram
    participant Benchmark (BaseBenchmark)
    participant Task/Scenario
    participant Validator
    participant CSV
    participant Summary

    loop For each task/scenario
        Benchmark->>Task/Scenario: Run task/scenario
        Task/Scenario->>Validator: Validate output
        Validator-->>Task/Scenario: Validation results/errors
        Task/Scenario-->>Benchmark: Score, details
        Benchmark->>CSV: Save detailed result
    end
    Benchmark->>Summary: Compute and save summary stats
    Summary-->>CSV: Write summary CSV

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 9

🔭 Outside diff range comments (1)

src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (1)

554-572: ⚠️ Potential issue

Guard against empty validators list to avoid ZeroDivisionError

len(self.validators) can legitimately be 0 (e.g. mis-configured task).
In that situation a division by zero will raise, crashing the whole benchmark run.

-        done_properly = 0
-        for validator in self.validators:
+        if not self.validators:
+            raise ValueError(
+                f"Task '{self.__class__.__name__}' has no validators configured."
+            )
+
+        done_properly = 0
+        for validator in self.validators:
             ...
-        return done_properly / len(self.validators)
+        return done_properly / len(self.validators)

🧹 Nitpick comments (16)

src/rai_bench/README.md (1)
185-210: Documentation update provides clear instructions for results visualization

The new "Results and Visualization" section clearly explains where benchmark results are stored and how to visualize them using Streamlit. However, there are a couple of minor grammar issues that should be fixed.

Consider these minor grammar improvements:
-If you run single benchmark test like:
+If you run a single benchmark test like:

-Results will be saved to dedicated directory named `<benchmark_name>`
+Results will be saved to a dedicated directory named `<benchmark_name>`
🧰 Tools

🪛 LanguageTool

[uncategorized] ~189-~189: You might be missing the article “a” here.
Context: ...](./rai_bench/experiments/) If you run single benchmark test like: When you run a tes...

(AI_EN_LECTOR_MISSING_DETERMINER_A)

[uncategorized] ~196-~196: You might be missing the article “a” here.
Context: ...>/main.py ``` Results will be saved to dedicated directory named <benchmark_name> Whe...

(AI_EN_LECTOR_MISSING_DETERMINER_A)
src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (2)
205-207: errors already contains lists – avoid an extra level of nesting

send_score accepts List[List[str]], while you currently pass a list of (possibly empty) lists collected from every sub-task.
If a validator has no errors you still insert [], inflating the payload and producing confusing semicolons in the comment string ("; ").
-        errors = [
-            s.errors
-            for validator_info in validation_info
-            for s in validator_info.subtasks
-        ]
+        errors = [
+            s.errors
+            for validator_info in validation_info
+            for s in validator_info.subtasks
+            if s.errors  # keep only non-empty error lists
+        ]
214-220: Name shadowing with builtin type may bite downstream

TaskResult exposes a field named type. Using the same identifier as the builtin can lead to accidental shadowing when you later access result.type inside comprehensions or debug statements.

Consider renaming the attribute to task_type (and applying the same alias in the Pydantic model) to keep the public surface unambiguous.

No diff provided because this change ripples through multiple files.
src/rai_bench/rai_bench/tool_calling_agent/scores_tracing.py (2)
42-65: Stronger validation for numerical fields

You document score as “value between 0 and 1” and extra_tool_calls_used as non-negative, but no constraints enforce this.
Pydantic has built-ins that can prevent illegal data early and surface bugs quicker:
-from pydantic import BaseModel, Field
+from pydantic import BaseModel, Field, NonNegativeInt, confloat
-    extra_tool_calls_used: int = Field(
+    extra_tool_calls_used: NonNegativeInt = Field(
         ..., description="Total number of extra tool calls used in this Task"
     )
-    score: float = Field(
-        ...,
-        description="Value between 0 and 1, describing how many validation setps passed",
-    )
+    score: confloat(ge=0.0, le=1.0) = Field(
+        ..., description="Fraction of validation steps passed (0-1)."
+    )
The same applies to the summary model.

This small addition guards against negative counts or impossible scores leaking into persisted CSVs or analytics.

66-76: Field name type conflicts with the builtin

Like in TaskResult, using type as an attribute may lead to subtle bugs when serialising/deserialising or during introspection.
Renaming it to task_type (with an alias if you need the CSV header unchanged) will avoid accidental shadowing.
task_type: str = Field(..., alias="type", description="Category of the task")
src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (2)
29-33: Superfluous super().__init__()?

SubTask currently has no explicit __init__; calling super().__init__() therefore just dispatches to object.__init__() and brings no value while slightly obscuring intent.

Unless you plan to add shared initialisation in the base class, consider removing these calls to keep the constructors lean.

45-51: info property is read-only – document it

Migrating from dump() to a read-only info property is a nice cleanup.
Add a docstring to the property so consumers know it is side-effect-free and cheap to call.
-    @property
-    def info(self) -> Dict[str, Any]:
+    @property
+    def info(self) -> Dict[str, Any]:
+        """Static description of the expected tool-call arguments."""
tests/rai_bench/tool_calling_agent/test_validators.py (2)
35-46: DummySubTask.validate type annotation is misleading

The annotation says tool_call: Dict[str, Any] but the tests pass a ToolCall instance that exposes .name.
Using the correct type improves static analysis and readability.
-    def validate(self, tool_call: Dict[str, Any]) -> bool:
+    def validate(self, tool_call: "ToolCall") -> bool:
(Quotes avoid a forward-reference import at the top of the test file.)

41-42: Unnecessary super().__init__() inside a plain class

DummySubTask does not inherit from SubTask; calling super().__init__() therefore does nothing.
Removing the call eliminates a minor source of confusion without affecting behaviour.
-        super().__init__()
src/rai_bench/rai_bench/tool_calling_agent/validators.py (1)
42-45: Avoid shadowing the built-in type identifier

Although legal, naming the public property type hides Python’s built-in type, which can lead to confusing stack-traces and makes IDE auto-completion noisier. Consider a more specific name, e.g. validator_type.
-    @property
-    def type(self) -> str:
-        return "ordered"
+    @property
+    def validator_type(self) -> str:
+        return "ordered"
src/rai_bench/rai_bench/results_processing/data_loading.py (2)
214-218: Simplify boolean return – remove double negative

check_if_summaries_present can be expressed more idiomatically and avoids the Ruff SIM103 warning:
-        if not self.runs or not any(run.benchmark_summaries for run in self.runs):
-            return False
-        return True
+        return bool(self.runs) and any(run.benchmark_summaries for run in self.runs)
🧰 Tools

🪛 Ruff (0.8.2)

215-217: Return the negated condition directly

Inline condition

(SIM103)

131-145: Streamlit dependency leaks into a pure data-loading layer

Functions load_detailed_data and load_summary_data call st.error, tightly coupling this utility module to Streamlit and making it unusable in non-UI contexts (e.g., CLI batch processing, tests).

Consider raising a custom exception or returning an Either-like result and leaving presentation-layer concerns (toast / log / st.error) to the caller.
src/rai_bench/rai_bench/results_processing/data_processing.py (1)
336-347: Unused loop key – address Ruff B007 and clarify intent

Variable key isn’t used inside the loop and may confuse readers.
If the key is only needed for iteration you can replace it with _; otherwise include it in the DataFrame row (e.g. as subtask_id) for traceability.
-for key, stats in subtask_stats.items():
+for _, stats in subtask_stats.items():
or
for key, stats in subtask_stats.items():
     subtask_rows.append(
         {
+            "subtask_id": key,
🧰 Tools

🪛 Ruff (0.8.2)

336-336: Loop control variable key not used within loop body

(B007)
src/rai_bench/rai_bench/results_processing/visualise.py (3)
41-44: Avoid hard-coded experiment directory

Using a relative, hard-coded path makes the app brittle when the package is installed
elsewhere or executed from a different working directory.
Expose it via CLI flag, environment variable, or Streamlit sidebar.
-EXPERIMENT_DIR = "./src/rai_bench/rai_bench/experiments"
+EXPERIMENT_DIR = os.getenv(
+    "RAI_BENCH_EXPERIMENT_DIR",
+    os.path.join(os.getcwd(), "experiments"),  # sensible default
+)
42-44: Remove unused constants

DETAILED_FILE_NAME and SUMMARY_FILE_NAME are defined but never referenced in this
module, which is a small smell and may confuse future readers.

603-628: Duplicate sub-task rendering code – use the helper instead

The block repeats the logic already encapsulated in display_subtask_analysis().
Keeping two copies risks divergence and increases maintenance cost.
-    # Subtasks table
-    ...
-    st.dataframe(error_df, use_container_width=True)  # type:ignore
-
-    # Additional analysis: group errors by tool name
-    if not error_df.empty:
-        ...
-        st.dataframe(error_by_tool_df, use_container_width=True)  # type:ignore
+    # Re-use the common helper to avoid duplication
+    display_subtask_analysis(subtask_df, tool_name_df, error_df)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b62fee1 and 3d74cf8.

⛔ Files ignored due to path filters (1)

src/rai_bench/poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (22)

src/rai_bench/README.md (1 hunks)
src/rai_bench/pyproject.toml (1 hunks)
src/rai_bench/rai_bench/examples/manipulation_o3de/main.py (4 hunks)
src/rai_bench/rai_bench/examples/test_models.py (1 hunks)
src/rai_bench/rai_bench/examples/tool_calling_agent/main.py (4 hunks)
src/rai_bench/rai_bench/examples/tool_calling_agent/tasks.py (2 hunks)
src/rai_bench/rai_bench/results_processing/__init__.py (1 hunks)
src/rai_bench/rai_bench/results_processing/data_loading.py (1 hunks)
src/rai_bench/rai_bench/results_processing/data_processing.py (1 hunks)
src/rai_bench/rai_bench/results_processing/visualise.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (3 hunks)
src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (5 hunks)
src/rai_bench/rai_bench/tool_calling_agent/scores_tracing.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py (2 hunks)
src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (8 hunks)
src/rai_bench/rai_bench/tool_calling_agent/validators.py (3 hunks)
tests/rai_bench/tool_calling_agent/test_subtasks.py (1 hunks)
tests/rai_bench/tool_calling_agent/test_validators.py (14 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (9)

src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py (1)

src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (1)

Task (447-571)

tests/rai_bench/tool_calling_agent/test_subtasks.py (3)

tests/rai_bench/tool_calling_agent/test_validators.py (1)

info (68-69)

src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (1)

info (56-58)

src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (4)

info (46-51)

info (85-91)

info (125-131)

info (165-171)

src/rai_bench/rai_bench/examples/manipulation_o3de/main.py (1)

src/rai_bench/rai_bench/examples/tool_calling_agent/main.py (2)

parse_args (27-50)

run_benchmark (53-90)

src/rai_bench/rai_bench/examples/test_models.py (2)

src/rai_bench/rai_bench/examples/tool_calling_agent/main.py (1)

run_benchmark (53-90)

src/rai_bench/rai_bench/examples/manipulation_o3de/main.py (1)

run_benchmark (68-210)

tests/rai_bench/tool_calling_agent/test_validators.py (3)

src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (7)

SubTaskValidationError (28-29)

Validator (353-444)

validate (51-52)

validate (443-444)

validate (554-571)

info (56-58)

dump_results (421-440)

src/rai_bench/rai_bench/tool_calling_agent/validators.py (4)

validate (46-75)

validate (95-144)

OrderedCallsValidator (29-75)

NotOrderedCallsValidator (78-144)

tests/rai_bench/tool_calling_agent/test_subtasks.py (2)

validate (37-38)

info (44-45)

src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (3)

tests/rai_bench/tool_calling_agent/test_subtasks.py (1)

info (44-45)

tests/rai_bench/tool_calling_agent/test_validators.py (1)

info (68-69)

src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (1)

info (56-58)

src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (4)

src/rai_bench/rai_bench/tool_calling_agent/scores_tracing.py (2)

SubTaskResult (29-32)

ValidatorResult (35-39)

tests/rai_bench/tool_calling_agent/test_subtasks.py (1)

info (44-45)

tests/rai_bench/tool_calling_agent/test_validators.py (1)

info (68-69)

src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (4)

info (46-51)

info (85-91)

info (125-131)

info (165-171)

src/rai_bench/rai_bench/tool_calling_agent/validators.py (4)

tests/rai_bench/tool_calling_agent/test_subtasks.py (1)

validate (37-38)

tests/rai_bench/tool_calling_agent/test_validators.py (2)

validate (47-65)

ToolCall (28-31)

src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (7)

validate (51-52)

validate (443-444)

validate (554-571)

SubTaskValidationError (28-29)

add_subtask_errors (403-409)

required_calls (400-401)

required_calls (525-530)

src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (4)

validate (34-43)

validate (68-82)

validate (108-122)

validate (148-162)

src/rai_bench/rai_bench/results_processing/visualise.py (2)

src/rai_bench/rai_bench/results_processing/data_loading.py (4)

ModelResults (207-258)

RunResults (273-282)

get_available_runs (177-195)

load_benchmark_results (321-380)

src/rai_bench/rai_bench/results_processing/data_processing.py (7)

analyze_subtasks (274-392)

create_extra_calls_dataframe (57-106)

create_model_summary_dataframe (28-54)

create_task_details_dataframe (172-213)

create_task_metrics_dataframe (118-169)

get_available_extra_tool_calls (250-271)

get_unique_values_from_results (216-247)

🪛 LanguageTool

src/rai_bench/README.md

[uncategorized] ~189-~189: You might be missing the article “a” here.
Context: ...](./rai_bench/experiments/) If you run single benchmark test like: When you run a tes...

(AI_EN_LECTOR_MISSING_DETERMINER_A)

[uncategorized] ~196-~196: You might be missing the article “a” here.
Context: ...>/main.py ``` Results will be saved to dedicated directory named <benchmark_name> Whe...

(AI_EN_LECTOR_MISSING_DETERMINER_A)

🪛 Ruff (0.8.2)

src/rai_bench/rai_bench/results_processing/data_loading.py

215-217: Return the negated condition directly

Inline condition

(SIM103)

src/rai_bench/rai_bench/results_processing/data_processing.py

336-336: Loop control variable key not used within loop body

(B007)

🔇 Additional comments (32)

src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py (1)

498-502: Consistent Task Typing with type Property
Adding the type property aligns NavigationTask with other task classes, enabling uniform task identification across validation, reporting, and visualization pipelines. This supports filtering and grouping by task type in downstream modules.

src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py (1)

47-50: Standardizing Task Metadata with type Property
The new type property ensures BasicTask is identified as "basic" in result structures, maintaining consistency across the tool calling agent framework.

src/rai_bench/rai_bench/results_processing/__init__.py (1)

1-14: Package Marker File Added
The __init__.py file correctly marks results_processing as a Python package, enabling imports of the new data loading, processing, and visualization modules. No further action needed here.

src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py (1)

48-52: Adding type Property to ManipulationTask
Introducing the type property standardizes task categorization as "manipulation", aligning with other task classes and facilitating consistent handling in the benchmarking and visualization modules.

src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py (1)

55-57: Implementation of type property aligns with task identification standards

The addition of the read-only type property returns a consistent string identifier for this task class, which aligns with the abstract property requirement from the base Task class and standardizes task type identification across the benchmark framework.

src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py (3)

837-841: Good abstraction: new base class for custom interface tasks

Creating this abstract base class improves the inheritance hierarchy by providing a common implementation of the type property for all custom interface tasks. This reduces code duplication and standardizes type identification.

843-843: Class inheritance updated correctly

The CustomInterfacesTopicTask now inherits from the new CustomInterfaceTask abstract base class, ensuring it receives the standardized type property implementation.

876-876: Class inheritance updated correctly

The CustomInterfacesServiceTask now inherits from the new CustomInterfaceTask abstract base class, ensuring it receives the standardized type property implementation.

tests/rai_bench/tool_calling_agent/test_subtasks.py (1)

43-45: Test fixture updated to match implementation changes

Adding the info property to the ConcreteSubTask test class aligns with the framework's refactoring from dump() methods to info properties. This ensures the test fixtures correctly mirror the actual implementation patterns.

src/rai_bench/rai_bench/examples/manipulation_o3de/main.py (5)

58-64: Good addition of configurable output directory parameter.

The addition of a timestamped default output directory improves results organization and makes the benchmark more flexible.

68-68: Function signature updated to support output directory parameter.

The run_benchmark function signature is correctly updated to accept the new out_dir parameter, which aligns with similar changes in other benchmark modules.

102-103: Properly creating output directory structure.

The code correctly creates the output directory and ensures parent directories exist, which is a good practice for file operations.

185-185: Results file path constructed from output directory.

The results file path is properly derived from the output directory parameter, ensuring consistent file organization.

215-215: Correctly passing output directory parameter to benchmark function.

The out_dir parameter is properly passed from the command line arguments to the benchmark function, completing the implementation.

src/rai_bench/rai_bench/examples/test_models.py (5)

14-14: Added datetime import for timestamp generation.

The datetime module is correctly imported to support generation of timestamped output directories.

22-23: Modified benchmark configuration lists.

The benchmark list is now focused on "manipulation_o3de" and a new extra_tool_calls list has been added with a value of 5. This change supports the enhanced parameterized benchmark runs.

26-29: Implemented timestamped base output directory.

Creating a base directory with timestamp ensures unique and organized storage of benchmark results for each run.

36-45: Enhanced directory structure creation for benchmark runs.

The nested loops now properly handle multiple benchmark configurations by creating a structured directory hierarchy based on benchmark type, model name, and extra tool calls. This organization will make result analysis more manageable.

46-65: Added robust error handling for benchmark execution.

The try-except block around benchmark runs is a good addition that prevents a single failed benchmark from stopping the entire test suite. The error message includes detailed information about which benchmark failed and why.

src/rai_bench/rai_bench/examples/tool_calling_agent/main.py (6)

23-23: Switched to dynamic task generation.

Changed from importing static all_tasks to the dynamic get_all_tasks function, which supports parameterized task creation.

33-35: Made model-name and vendor parameters required.

Correctly enforced required parameters for benchmark execution, which prevents errors from missing critical configuration.

37-48: Added configurable extra-tool-calls and output directory parameters.

These new parameters enhance benchmark flexibility:

extra-tool-calls allows configuring tolerance for additional tool calls

out_dir allows specifying where results are stored

The default output directory uses a timestamp format consistent with other benchmark scripts.

53-53: Updated function signature to support new parameters.

Properly updated the run_benchmark function to accept the new out_dir and extra_tool_calls parameters, enabling the new functionality.

74-74: Implemented dynamic task generation with extra tool calls.

Now correctly uses the get_all_tasks function with the specified extra_tool_calls parameter, enabling dynamic task generation based on the configuration.

95-100: Updated main execution to pass new parameters.

The main execution block properly passes all parameters to the run_benchmark function, completing the implementation of the enhanced benchmark configuration.

src/rai_bench/rai_bench/examples/tool_calling_agent/tasks.py (7)

16-16: Added Sequence to imports.

Added Sequence import to support return type annotation in the get_spatial_tasks function.

295-330: Refactored basic tasks into dynamic generation function.

Converted static basic task list into a function that generates tasks with configurable extra_tool_calls parameter. This enables more flexible benchmarking with varying tolerance levels.

333-351: Refactored navigation tasks into dynamic generation function.

Converted static navigation task list into a function that generates tasks with configurable extra_tool_calls parameter. Consistently applies the same pattern as other task types.

354-366: Refactored manipulation tasks into dynamic generation function.

Converted static manipulation task list into a function that generates tasks with configurable extra_tool_calls parameter, maintaining consistency with the other task type refactorings.

369-383: Refactored custom interface tasks into dynamic generation function.

Converted static custom interface task list into a function that generates tasks with configurable extra_tool_calls parameter, continuing the pattern established for other task types.

386-403: Implemented spatial tasks dynamic generation function.

Created a function to dynamically generate both true and false spatial tasks with configurable extra_tool_calls parameter. Note the return type is Sequence[Task] rather than List[Task] to accommodate the combined list return.

406-415: Implemented unified task generation function.

Created a comprehensive get_all_tasks function that:

Collects tasks from all specialized task generators

Passes the extra_tool_calls parameter to each

Shuffles the tasks for randomized execution

Returns the combined task list

This enables flexible benchmark configuration through a single parameter.

src/rai_bench/pyproject.toml

src/rai_bench/rai_bench/tool_calling_agent/benchmark.py

src/rai_bench/rai_bench/tool_calling_agent/validators.py

src/rai_bench/rai_bench/tool_calling_agent/interfaces.py

src/rai_bench/rai_bench/results_processing/visualise.py

visualize tasks per type

fixed number of tool calls and extra calss being returned

visualised results by extra tool call aviailable dynamic adjustments of bar width

loading results into data models for more clarity

added unit test for that case

new base benchmark class update visualise to manipulation bench

jmatejcz · 2025-04-30T11:13:19Z

@CodeRabbit full review

coderabbitai

Actionable comments posted: 7

🔭 Outside diff range comments (3)

src/rai_bench/rai_bench/examples/tool_calling_agent/tasks.py (1)

295-416: 🛠️ Refactor suggestion

Good refactoring to dynamic task generation with configurable tool calls.

Replacing static task lists with dynamic task-generating functions provides several benefits:

Runtime configurability of extra_tool_calls parameter

Consistent parameter application across all task types

More flexible benchmark configuration

Better code organization with clear function boundaries

This change aligns well with the PR's goal of standardizing benchmark handling and supports the new structured results tracking.

However, having both the static task lists (lines 233-292) and the new dynamic functions creates duplication that could lead to confusion.

Consider removing the static task lists (basic_tasks, navigation_tasks, etc.) as they're now replaced by the dynamic functions. If they're kept for backward compatibility, add a deprecation comment to make it clear they'll be removed in the future.
src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (1)
29-33: ⚠️ Potential issue

Property vs. method – interface mismatch with SubTask.info()

SubTask.info is still defined as a method (see interfaces.py) but subclasses were converted to a property.
Call-sites written against the older contract will now raise TypeError: 'dict' object is not callable.

Two options:

Keep the property but update the base class + all usages:
-class SubTask(ABC):
-    def info(self) -> Dict[str, Any]:
-        ...
+class SubTask(ABC):
+    @property
+    @abstractmethod
+    def info(self) -> Dict[str, Any]:
+        ...
Revert these changes and keep def info(self) -> Dict[str, Any]:.

Pick one style and apply it consistently across the codebase and tests.

Also applies to: 45-51
src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (1)
114-117: ⚠️ Potential issue

Using HumanMultimodalMessage for non-multimodal tasks may break some model back-ends

When the task is not a SpatialReasoningAgentTask, the code still wraps the prompt in HumanMultimodalMessage (without images).
Some model routes expect plain HumanMessage and will reject multimodal payloads.

Consider falling back to HumanMessage for non-image tasks:
-                for state in agent.stream(
-                    {"messages": [HumanMultimodalMessage(content=task.get_prompt())]},
+                for state in agent.stream(
+                    {"messages": [HumanMessage(content=task.get_prompt())]},
(or add images=[] if the backend tolerates it).

♻️ Duplicate comments (6)

src/rai_bench/pyproject.toml (1)

14-14: ⚠️ Potential issue

Missing Streamlit Dependency

The visualization script (visualise_results.py) introduced in this PR relies on Streamlit, but pyproject.toml only adds plotly. You must add streamlit (and any other new visualization dependencies) under [tool.poetry.dependencies] to avoid runtime import errors during result visualization.

src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (2)

84-91: Same issue as above for CheckTopicFieldsToolCallSubTask

Fix will be covered once SubTask contract is clarified.

124-131: Same issue as above for CheckServiceFieldsToolCallSubTask & CheckActionFieldsToolCallSubTask

See earlier comment.

src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (1)

63-65: Ensure results_dir exists before CSV initialization

If the directory doesn’t exist, csv_initialize will fail here. Creating it in BaseBenchmark (see earlier suggestion) solves the problem globally.
src/rai_bench/rai_bench/tool_calling_agent/validators.py (1)
106-112: ⚠️ Potential issue

Possible negative extra_calls_used in early-exit path

If the agent finishes all subtasks exactly on the last allowed tool call, i - self.required_calls yields -1.
-                self.extra_calls_used = i - self.required_calls
+                self.extra_calls_used = max(0, i - self.required_calls)
This mirrors the fix already applied in OrderedCallsValidator.
src/rai_bench/rai_bench/results_processing/visualise.py (1)
74-76: ⚠️ Potential issue

Bar width still shrinks for small bar counts (previous feedback not applied)

The scaling formula here makes bars narrower when the number of bars is below the max_full_width_bars threshold (e.g., 3 bars → width = 0.24).
In the earlier review we suggested the inverse logic so that full width is kept for small n and only shrinks when n exceeds the threshold.
-    scale = min(n_bars, max_full_width_bars) / max_full_width_bars
-    width = base_width * scale
+    # Keep full width when bar count is small; scale down only when crowded
+    scale = 1.0 if n_bars <= max_full_width_bars else max_full_width_bars / n_bars
+    width = base_width * scale
This provides a consistent, more readable chart for short bar series.

🧹 Nitpick comments (21)

src/rai_bench/rai_bench/examples/test_models.py (1)
66-66: Redundant continue statement

The continue statement at line 66 is unnecessary as it's at the end of the loop.
-                        continue
src/rai_bench/rai_bench/base_benchmark.py (3)
67-83: csv_initialize overwrites the file each time – is that intentional?

Every call opens the file with mode "w" which truncates any earlier summary.
If you intend to keep historical summaries (e.g. one row per incremental run), switch to append mode when the header already exists.
-        with open(filename, mode="w", newline="", encoding="utf-8") as file:
+        mode = "w" if not filename.exists() else "a"
+        with open(filename, mode=mode, newline="", encoding="utf-8") as file:
If truncating is deliberate, please leave a comment explaining the rationale so future readers don’t treat this as a bug.

103-107: Prefer the model class for headers

base_model_instance.__annotations__ relies on attribute look-up falling back to the class.
Being explicit is clearer and side-steps edge cases (e.g. a field called __annotations__ on the instance).
-            writer = csv.DictWriter(
-                file, fieldnames=base_model_instance.__annotations__.keys()
-            )
+            writer = csv.DictWriter(
+                file, fieldnames=type(base_model_instance).__annotations__.keys()
+            )
125-125: Open TODO or remove it

# TODO (jm) this can be probably same for all benchmark in the future does not have an accompanying issue reference.
Either open a GitHub issue and link it here or replace the comment with a concrete plan; otherwise it will be forgotten.
src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (2)
58-65: Duplicate attribute – self.tasks_results is unused

self.task_results is declared on line 60 and actively used later.
self.tasks_results (plural) on line 63 is never referenced and risks confusion.
-        self.tasks_results: List[TaskResult] = []
Delete the stray attribute.

172-179: Guard against division / mean on empty list

Although currently called after at least one task, future refactors (early summary, test harness, etc.) could invoke compute_and_save_summary before any results exist, causing ZeroDivisionError or StatisticsError.
-        success_rate = success_count / len(self.task_results) * 100
-        avg_time = statistics.mean(r.total_time for r in self.task_results)
+        total = len(self.task_results) or 1  # prevent div/0
+        success_rate = success_count / total * 100
+        avg_time = (
+            statistics.mean(r.total_time for r in self.task_results)
+            if self.task_results
+            else 0.0
+        )
src/rai_bench/rai_bench/manipulation_o3de/benchmark.py (2)
235-241: Exact float comparison can under-count successes
success_count = sum(1 for r in self.scenario_results if r.score == 1.0)
calculate_score() is likely to return a float. Relying on strict equality may miss values such as 0.999999 that should semantically map to success.
-success_count = sum(1 for r in self.scenario_results if r.score == 1.0)
+EPS = 1e-6
+success_count = sum(1 for r in self.scenario_results if r.score >= 1.0 - EPS)
At minimum, use math.isclose() with an absolute/relative tolerance or compare against a threshold.

166-204: last_msg is built but never used – remove or log it consistently

last_msg is assigned inside the streaming loop, yet its only usage is inside a debug line under the same branch.
If the value is useful for downstream analysis, consider logging it in a structured way; otherwise drop the variable to avoid dead code and noisy conditionals.
tests/rai_bench/tool_calling_agent/test_validators.py (1)

72-95: assert_dumped helper: valuable but missing negative-case coverage

The helper validates many fields but silently returns the result.
Adding assertions for unexpected keys (e.g. that no unknown fields appear) or validating dataclass types (if applicable) would catch accidental schema regressions early.
src/rai_bench/rai_bench/results_processing/data_processing.py (2)
136-153: getattr(result, group_by) may raise – guard for invalid group_by

create_task_metrics_dataframe trusts the caller to pass a valid attribute name.
A typo will raise AttributeError and break the whole visualisation pipeline.

Consider:
if not hasattr(TaskResult, group_by):
    raise ValueError(f"Invalid group_by field '{group_by}'")
or fallback to None and skip such rows.

300-334: JSON key built but never used – unnecessary overhead

key = json.dumps(args, sort_keys=True) is computed, yet only key not in subtask_stats is evaluated; the string is never logged or returned.
Storing the raw dict (with frozenset(args.items()) as key) would avoid serialisation overhead and keep look-ups fast.
src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (5)
54-58: Doc-string still references removed dump() method

The interface now exposes an info property instead of a dump() method, but the class-level doc-string (lines 42-48) still instructs implementers to override dump(). This is confusing for new contributors and tooling that relies on doc-strings.
-        dump()
-            Abstract method that subclasses must implement to serialize the subtask configuration.
+        info (property)
+            Returns a serialisable dictionary describing the sub-task’s expected arguments.
394-398: Avoid using the built-in name type for public API

Both Validator and Task expose a type property. Shadowing the built-in type can be surprising in debugging sessions and hampers IDE code-completion. Consider renaming to something more explicit, e.g. validator_type / task_type.
-    @property
-    @abstractmethod
-    def type(self) -> str:
+    @property
+    @abstractmethod
+    def validator_type(self) -> str:
(Apply the same rename in subclasses and everywhere the property is consumed.)

411-419: Minor grammar & clarity in reset() doc-string

“resets all values refering previous validation” → “Reset all state from the previous validation run.”

(Only the comment needs updating; code logic is fine.)

421-450: dump_results() returns mutable objects that can be tampered with by callers

dump_results() hands out the live ValidatorResult object which in turn contains the same SubTaskResult instances held internally. External mutation after dumping will silently alter the validator’s history.

If immutability is important consider returning a deep copy:
-        result = ValidatorResult(
+        result = ValidatorResult(
             ...
         )
-        return result
+        # prevent external mutation
+        return result.model_copy(deep=True)
515-517: Method name suggests serialisation but only aggregates

dump_validators() does not serialise data, it merely aggregates ValidatorResult objects. A clearer name such as collect_validator_results() would make call-sites self-explanatory.
src/rai_bench/rai_bench/results_processing/data_loading.py (2)
271-289: Non-deterministic ordering of runs

os.listdir() returns entries in arbitrary order which causes run ordering (and therefore visualisation order) to vary across platforms and executions. A simple sorted() ensures deterministic output:
-    for d in os.listdir(experiment_dir)
+    for d in sorted(os.listdir(experiment_dir))
295-324: Silent discard of malformed UUIDs

uuid.UUID(row["run_id"]) will raise ValueError and abort the whole load if a single row contains an invalid UUID. Consider wrapping this in try/except so that one bad row does not kill the entire dataset load, possibly logging a warning via Streamlit instead.
src/rai_bench/rai_bench/tool_calling_agent/validators.py (2)

31-33: Typo in class doc-string

subtaks → subtasks.

58-63: Iterator exhaustion can lose the last error

When the final subtask fails on the last tool call, the StopIteration branch is never reached and the surrounding for loop exits normally, falling through to the generic “Validation failed” block. Variable u still refers to the last subtask index, so this is fine; nonetheless consider explicitly logging the collected self.subtasks_errors[u] here to make debugging easier.
src/rai_bench/rai_bench/results_processing/visualise.py (1)
149-154: Unused accumulator introduces dead code

model_counts is populated but never read afterwards. Removing it (or using it) will simplify the function and avoid confusion.
-    # Dictionary to store aggregated metrics for each model
-    model_metrics: Dict[str, Dict[str, Any]] = {}
-
-    # Track total counts for averaging
-    model_counts: Dict[str, int] = {}
+    # Dictionary to store aggregated metrics for each model
+    model_metrics: Dict[str, Dict[str, Any]] = {}
And delete the later model_counts[model_name] += … line.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between 73785b9 and 4f10bae.

⛔ Files ignored due to path filters (1)

src/rai_bench/poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (34)

src/rai_bench/README.md (1 hunks)
src/rai_bench/pyproject.toml (1 hunks)
src/rai_bench/rai_bench/base_benchmark.py (1 hunks)
src/rai_bench/rai_bench/examples/manipulation_o3de/main.py (5 hunks)
src/rai_bench/rai_bench/examples/manipulation_o3de/scenarios.py (14 hunks)
src/rai_bench/rai_bench/examples/test_models.py (2 hunks)
src/rai_bench/rai_bench/examples/tool_calling_agent/main.py (4 hunks)
src/rai_bench/rai_bench/examples/tool_calling_agent/tasks.py (2 hunks)
src/rai_bench/rai_bench/manipulation_o3de/benchmark.py (5 hunks)
src/rai_bench/rai_bench/manipulation_o3de/interfaces.py (5 hunks)
src/rai_bench/rai_bench/manipulation_o3de/results_tracking.py (1 hunks)
src/rai_bench/rai_bench/manipulation_o3de/tasks/build_tower_task.py (1 hunks)
src/rai_bench/rai_bench/manipulation_o3de/tasks/group_objects_task.py (1 hunks)
src/rai_bench/rai_bench/manipulation_o3de/tasks/move_object_to_left_task.py (1 hunks)
src/rai_bench/rai_bench/manipulation_o3de/tasks/place_at_coord_task.py (1 hunks)
src/rai_bench/rai_bench/manipulation_o3de/tasks/place_cubes_task.py (1 hunks)
src/rai_bench/rai_bench/manipulation_o3de/tasks/rotate_object_task.py (1 hunks)
src/rai_bench/rai_bench/results_processing/__init__.py (1 hunks)
src/rai_bench/rai_bench/results_processing/data_loading.py (1 hunks)
src/rai_bench/rai_bench/results_processing/data_processing.py (1 hunks)
src/rai_bench/rai_bench/results_processing/visualise.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (3 hunks)
src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (5 hunks)
src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py (2 hunks)
src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (8 hunks)
src/rai_bench/rai_bench/tool_calling_agent/validators.py (3 hunks)
tests/rai_bench/manipulation_o3de/tasks/test_task.py (1 hunks)
tests/rai_bench/tool_calling_agent/test_subtasks.py (1 hunks)
tests/rai_bench/tool_calling_agent/test_validators.py (14 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (14)

src/rai_bench/rai_bench/manipulation_o3de/tasks/group_objects_task.py (7)

src/rai_bench/rai_bench/manipulation_o3de/tasks/build_tower_task.py (1)

task_prompt (79-81)

src/rai_bench/rai_bench/manipulation_o3de/interfaces.py (1)

task_prompt (64-68)

src/rai_bench/rai_bench/manipulation_o3de/tasks/move_object_to_left_task.py (1)

task_prompt (41-45)

src/rai_bench/rai_bench/manipulation_o3de/tasks/place_at_coord_task.py (1)

task_prompt (57-62)

src/rai_bench/rai_bench/manipulation_o3de/tasks/place_cubes_task.py (1)

task_prompt (49-50)

src/rai_bench/rai_bench/manipulation_o3de/tasks/rotate_object_task.py (1)

task_prompt (53-59)

tests/rai_bench/manipulation_o3de/tasks/test_task.py (1)

task_prompt (26-27)

src/rai_bench/rai_bench/manipulation_o3de/tasks/build_tower_task.py (7)

src/rai_bench/rai_bench/manipulation_o3de/tasks/group_objects_task.py (1)

task_prompt (51-62)

src/rai_bench/rai_bench/manipulation_o3de/interfaces.py (1)

task_prompt (64-68)

src/rai_bench/rai_bench/manipulation_o3de/tasks/move_object_to_left_task.py (1)

task_prompt (41-45)

src/rai_bench/rai_bench/manipulation_o3de/tasks/place_at_coord_task.py (1)

task_prompt (57-62)

src/rai_bench/rai_bench/manipulation_o3de/tasks/place_cubes_task.py (1)

task_prompt (49-50)

src/rai_bench/rai_bench/manipulation_o3de/tasks/rotate_object_task.py (1)

task_prompt (53-59)

tests/rai_bench/manipulation_o3de/tasks/test_task.py (1)

task_prompt (26-27)

src/rai_bench/rai_bench/manipulation_o3de/tasks/place_at_coord_task.py (7)

src/rai_bench/rai_bench/manipulation_o3de/tasks/group_objects_task.py (1)

task_prompt (51-62)

src/rai_bench/rai_bench/manipulation_o3de/tasks/build_tower_task.py (1)

task_prompt (79-81)

src/rai_bench/rai_bench/manipulation_o3de/interfaces.py (1)

task_prompt (64-68)

src/rai_bench/rai_bench/manipulation_o3de/tasks/move_object_to_left_task.py (1)

task_prompt (41-45)

src/rai_bench/rai_bench/manipulation_o3de/tasks/place_cubes_task.py (1)

task_prompt (49-50)

src/rai_bench/rai_bench/manipulation_o3de/tasks/rotate_object_task.py (1)

task_prompt (53-59)

tests/rai_bench/manipulation_o3de/tasks/test_task.py (1)

task_prompt (26-27)

src/rai_bench/rai_bench/examples/test_models.py (2)

src/rai_bench/rai_bench/examples/manipulation_o3de/main.py (1)

run_benchmark (64-197)

src/rai_bench/rai_bench/examples/tool_calling_agent/main.py (1)

run_benchmark (53-92)

src/rai_bench/rai_bench/manipulation_o3de/tasks/rotate_object_task.py (7)

src/rai_bench/rai_bench/manipulation_o3de/tasks/group_objects_task.py (1)

task_prompt (51-62)

src/rai_bench/rai_bench/manipulation_o3de/tasks/build_tower_task.py (1)

task_prompt (79-81)

src/rai_bench/rai_bench/manipulation_o3de/interfaces.py (1)

task_prompt (64-68)

src/rai_bench/rai_bench/manipulation_o3de/tasks/move_object_to_left_task.py (1)

task_prompt (41-45)

src/rai_bench/rai_bench/manipulation_o3de/tasks/place_at_coord_task.py (1)

task_prompt (57-62)

src/rai_bench/rai_bench/manipulation_o3de/tasks/place_cubes_task.py (1)

task_prompt (49-50)

tests/rai_bench/manipulation_o3de/tasks/test_task.py (1)

task_prompt (26-27)

src/rai_bench/rai_bench/manipulation_o3de/tasks/move_object_to_left_task.py (7)

src/rai_bench/rai_bench/manipulation_o3de/tasks/group_objects_task.py (1)

task_prompt (51-62)

src/rai_bench/rai_bench/manipulation_o3de/tasks/build_tower_task.py (1)

task_prompt (79-81)

src/rai_bench/rai_bench/manipulation_o3de/interfaces.py (1)

task_prompt (64-68)

src/rai_bench/rai_bench/manipulation_o3de/tasks/place_at_coord_task.py (1)

task_prompt (57-62)

src/rai_bench/rai_bench/manipulation_o3de/tasks/place_cubes_task.py (1)

task_prompt (49-50)

src/rai_bench/rai_bench/manipulation_o3de/tasks/rotate_object_task.py (1)

task_prompt (53-59)

tests/rai_bench/manipulation_o3de/tasks/test_task.py (1)

task_prompt (26-27)

src/rai_bench/rai_bench/manipulation_o3de/tasks/place_cubes_task.py (7)

src/rai_bench/rai_bench/manipulation_o3de/tasks/group_objects_task.py (1)

task_prompt (51-62)

src/rai_bench/rai_bench/manipulation_o3de/tasks/build_tower_task.py (1)

task_prompt (79-81)

src/rai_bench/rai_bench/manipulation_o3de/interfaces.py (1)

task_prompt (64-68)

src/rai_bench/rai_bench/manipulation_o3de/tasks/move_object_to_left_task.py (1)

task_prompt (41-45)

src/rai_bench/rai_bench/manipulation_o3de/tasks/place_at_coord_task.py (1)

task_prompt (57-62)

src/rai_bench/rai_bench/manipulation_o3de/tasks/rotate_object_task.py (1)

task_prompt (53-59)

tests/rai_bench/manipulation_o3de/tasks/test_task.py (1)

task_prompt (26-27)

src/rai_bench/rai_bench/examples/tool_calling_agent/main.py (4)

src/rai_bench/rai_bench/examples/tool_calling_agent/tasks.py (1)

get_all_tasks (406-415)

src/rai_bench/rai_bench/examples/manipulation_o3de/main.py (2)

parse_args (44-61)

run_benchmark (64-197)

src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (1)

set_logger (492-495)

src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (2)

ToolCallingAgentBenchmark (43-193)

run_next (66-170)

tests/rai_bench/tool_calling_agent/test_subtasks.py (3)

src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (1)

info (56-58)

tests/rai_bench/tool_calling_agent/test_validators.py (1)

info (68-69)

src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (4)

info (46-51)

info (85-91)

info (125-131)

info (165-171)

src/rai_bench/rai_bench/examples/manipulation_o3de/main.py (4)

src/rai_bench/rai_bench/manipulation_o3de/benchmark.py (1)

ManipulationO3DEBenchmark (76-265)

src/rai_bench/rai_bench/examples/tool_calling_agent/main.py (2)

parse_args (27-50)

run_benchmark (53-92)

src/rai_core/rai/agents/conversational_agent.py (2)

agent (34-46)

create_conversational_agent (49-81)

src/rai_bench/rai_bench/manipulation_o3de/interfaces.py (2)

system_prompt (59-60)

system_prompt (381-389)

src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (3)

src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (1)

info (56-58)

tests/rai_bench/tool_calling_agent/test_subtasks.py (1)

info (44-45)

tests/rai_bench/tool_calling_agent/test_validators.py (1)

info (68-69)

src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (5)

src/rai_bench/rai_bench/base_benchmark.py (6)

BaseBenchmark (36-125)

BenchmarkSummary (24-33)

csv_initialize (68-82)

run_next (110-118)

csv_writerow (85-107)

compute_and_save_summary (121-123)

src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (9)

Task (457-581)

info (56-58)

get_prompt (554-562)

get_tool_calls_from_messages (505-513)

validate (51-52)

validate (453-454)

validate (564-581)

dump_validators (515-516)

get_system_prompt (543-551)

src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py (4)

ScoreTracingHandler (67-108)

TaskResult (42-64)

get_callbacks (75-76)

send_score (79-108)

src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py (2)

SpatialReasoningAgentTask (37-71)

get_system_prompt (70-71)

src/rai_bench/rai_bench/manipulation_o3de/benchmark.py (2)

run_next (149-229)

compute_and_save_summary (231-265)

src/rai_bench/rai_bench/tool_calling_agent/validators.py (4)

src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (8)

validate (51-52)

validate (453-454)

validate (564-581)

reset (411-419)

SubTaskValidationError (28-29)

add_subtask_errors (403-409)

required_calls (400-401)

required_calls (535-540)

tests/rai_bench/tool_calling_agent/test_subtasks.py (1)

validate (37-38)

tests/rai_bench/tool_calling_agent/test_validators.py (2)

validate (47-65)

ToolCall (28-31)

src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (4)

validate (34-43)

validate (68-82)

validate (108-122)

validate (148-162)

src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (4)

src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py (2)

SubTaskResult (29-32)

ValidatorResult (35-39)

tests/rai_bench/tool_calling_agent/test_subtasks.py (1)

info (44-45)

tests/rai_bench/tool_calling_agent/test_validators.py (1)

info (68-69)

src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (4)

info (46-51)

info (85-91)

info (125-131)

info (165-171)

🔇 Additional comments (63)

src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py (3)

832-836: Task hierarchy improvement

Good addition of the CustomInterfaceTask base class with a type property. This creates a consistent approach to task type identification across the benchmarking framework.

838-838: Updated inheritance hierarchy

The refactoring to use the new CustomInterfaceTask base class ensures consistent type identification and aligns with similar changes in other task modules.

871-871: Updated inheritance hierarchy

The refactoring to use the new CustomInterfaceTask base class ensures consistent type identification and aligns with similar changes in other task modules.

src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py (1)

499-502: Task type property implementation

Good addition of the type property to the NavigationTask class. This standardized identification mechanism helps with task filtering, result tracking, and visualization as part of the broader benchmark infrastructure overhaul.

src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py (1)

47-50: Task type property implementation

Good addition of the type property to the BasicTask class. This standardized identification mechanism helps with task filtering, result tracking, and visualization as part of the broader benchmark infrastructure overhaul.

src/rai_bench/rai_bench/results_processing/__init__.py (1)

1-13: LGTM: New package marker for results processing

This file properly marks the new results_processing directory as a Python package. The Apache 2.0 license header is correctly included.

src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py (1)

49-52: LGTM: Consistent task type identification

The addition of the type property aligns with the standardization of task type identification across the codebase, making it easier to categorize and filter tasks by their type.

src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py (1)

55-58: LGTM: Consistent task type identification

The addition of the type property maintains consistency with other task classes, facilitating task classification and selection in the benchmark framework.

src/rai_bench/rai_bench/manipulation_o3de/tasks/place_cubes_task.py (1)

48-50: LGTM: Method converted to property for consistent interface

Converting get_prompt() to the task_prompt property aligns with the interface standardization across task classes. This makes the API more consistent and follows Python conventions for read-only properties.

src/rai_bench/rai_bench/manipulation_o3de/tasks/place_at_coord_task.py (1)

56-62: Method converted to property to align with interface changes.

The get_prompt method has been refactored into a task_prompt property, maintaining the same functionality. This change aligns with the updated interface in ManipulationTask where prompt access is standardized across all tasks.

src/rai_bench/rai_bench/manipulation_o3de/tasks/group_objects_task.py (1)

50-62: Method converted to property to align with interface changes.

The get_prompt method has been properly converted to a task_prompt property while preserving the existing functionality. This refactoring provides a consistent interface across all task classes.

src/rai_bench/rai_bench/manipulation_o3de/tasks/build_tower_task.py (1)

78-81: Method converted to property to align with interface changes.

The get_prompt method has been correctly refactored into a task_prompt property while maintaining the same functionality. This aligns with the standardized prompt access pattern implemented across all task classes.

src/rai_bench/rai_bench/manipulation_o3de/tasks/rotate_object_task.py (1)

52-59: Method converted to property to align with interface changes.

The get_prompt method has been appropriately refactored into a task_prompt property. The implementation logic remains unchanged, continuing to generate a descriptive prompt about rotating objects to a target quaternion orientation.

src/rai_bench/rai_bench/manipulation_o3de/tasks/move_object_to_left_task.py (1)

40-41: Method refactored to property for consistent interface

The get_prompt method has been properly converted to a read-only property task_prompt, maintaining the same functionality while aligning with the refactored interface used throughout the codebase. This change improves consistency with other task classes.

src/rai_bench/rai_bench/examples/manipulation_o3de/scenarios.py (16)

21-21: Import updated to use refactored class name

The import statement has been updated to use ManipulationO3DEBenchmark instead of Benchmark, reflecting the class renaming refactoring that's part of the broader benchmark restructuring.

89-93: Method call updated to use refactored class name

The call to create_scenarios has been correctly updated to use the refactored class name.

102-106: Method call updated to use refactored class name

The call to create_scenarios has been correctly updated to use the refactored class name.

171-176: Method call updated to use refactored class name

The call to create_scenarios has been correctly updated to use the refactored class name.

191-195: Method call updated to use refactored class name

The call to create_scenarios has been correctly updated to use the refactored class name.

199-203: Method call updated to use refactored class name

The call to create_scenarios has been correctly updated to use the refactored class name.

287-292: Method call updated to use refactored class name

The call to create_scenarios has been correctly updated to use the refactored class name.

296-301: Method call updated to use refactored class name

The call to create_scenarios has been correctly updated to use the refactored class name.

313-317: Method call updated to use refactored class name

The call to create_scenarios has been correctly updated to use the refactored class name.

333-337: Method call updated to use refactored class name

The call to create_scenarios has been correctly updated to use the refactored class name.

421-425: Method call updated to use refactored class name

The call to create_scenarios has been correctly updated to use the refactored class name.

429-433: Method call updated to use refactored class name

The call to create_scenarios has been correctly updated to use the refactored class name.

445-450: Method call updated to use refactored class name

The call to create_scenarios has been correctly updated to use the refactored class name.

467-471: Method call updated to use refactored class name

The call to create_scenarios has been correctly updated to use the refactored class name.

537-542: Method call updated to use refactored class name

The call to create_scenarios has been correctly updated to use the refactored class name.

558-562: Method call updated to use refactored class name

The call to create_scenarios has been correctly updated to use the refactored class name.

src/rai_bench/README.md (1)

185-210: Well-documented results storage and visualization section

The new documentation section clearly explains the results storage structure and provides helpful instructions for visualizing benchmark results using Streamlit. This addition aligns well with the new benchmark results handling functionality.

tests/rai_bench/tool_calling_agent/test_subtasks.py (1)

43-45: Added info property to align with interface changes

The implementation of the info property in the mock subtask correctly aligns with the interface changes in the SubTask class, where dump() has been replaced with an info property. This follows good practice by properly mocking the complete interface.

tests/rai_bench/manipulation_o3de/tasks/test_task.py (2)

25-31: Interface design updated to use properties instead of methods

The task interface has been refactored to use properties for task_prompt and system_prompt instead of methods. This change improves readability and aligns with the interface updates in the main benchmarking infrastructure.

36-36: Method renamed for clarity and consistency

Renaming calculate_result to calculate_score improves clarity and maintains consistency with the naming conventions used in other benchmark components.

src/rai_bench/rai_bench/manipulation_o3de/results_tracking.py (1)

18-33: Well-structured result model using Pydantic

The ScenarioResult class provides a clean, type-safe way to track benchmark results. The use of Pydantic with descriptive field metadata ensures data integrity and documentation.

This model aligns with the broader effort to standardize result tracking across benchmarks, enabling consistent data processing and visualization in the new pipeline.

src/rai_bench/rai_bench/examples/test_models.py (3)

14-29: Improved results organization with timestamped directories

Adding timestamp-based directory structure for benchmark results enables better tracking and comparison of benchmark runs over time.

35-45: Enhanced test structure with parameterized runs

The nested loop structure allows for running benchmarks with different extra tool call counts, creating a more organized directory structure for results comparison.

46-65: Updated benchmark execution with proper error handling

Benchmark execution is now properly encapsulated with try-except blocks and detailed error reporting.

src/rai_bench/rai_bench/examples/tool_calling_agent/main.py (8)

23-23: Improved task management with dynamic task generation

Importing the get_all_tasks function instead of a predefined list allows for dynamic task configuration based on parameters like extra tool calls.

29-41: Enhanced command-line interface with required arguments

Making model name and vendor required arguments and adding support for extra tool calls improves the CLI usability.

42-48: Improved output directory configuration

Adding a configurable output directory with a sensible default timestamp-based path improves flexibility and organization.

53-55: Updated benchmark function signature with new parameters

The function signature now includes extra_tool_calls and uses a more flexible output directory approach.

73-75: Dynamic task generation with configurable extra tool calls

Using get_all_tasks with the extra_tool_calls parameter enables flexible benchmark configuration.

77-82: Updated benchmark initialization with structured results directory

The benchmark is now initialized with explicit model name and results directory parameters, aligning with the new standardized results tracking approach.

92-92: Simplified benchmark run method call

Removing the redundant model_name parameter from the run_next call improves the code cleanliness.

97-102: Updated main function with new parameter passing

The main function now correctly passes all required parameters to the run_benchmark function.

src/rai_bench/rai_bench/examples/manipulation_o3de/main.py (10)

38-38: Import renamed to use specific benchmark class.

The import is updated to use ManipulationO3DEBenchmark instead of the generic Benchmark, which aligns with the PR's goal of unifying benchmark implementations.

54-60: Good addition of standardized output directory parameter.

Adding the --out_dir parameter with a default timestamped directory follows the pattern established in other benchmarks and supports the PR's goal of consistent result storage across all benchmarks.

64-64: Function signature updated for output directory handling.

The run_benchmark function now accepts the out_dir parameter, allowing for consistent output directory handling across the application.

89-91: Updated directory creation for consistent result storage.

Using the passed out_dir parameter to create the output directory and log file, rather than generating a new path internally, provides a more consistent approach to result storage.

155-166: Scenarios limited to trivial ones during refactoring.

The commenting out of more complex scenarios (easy, medium, hard, very hard) simplifies testing during this refactoring phase. However, ensure these are re-enabled for comprehensive testing once the refactoring is stable.

Is limiting to only trivial scenarios permanent or just for the initial testing of the benchmark infrastructure? Consider adding a TODO comment if these should be re-enabled later.

168-168: All scenarios simplified to only trivial ones.

This change aligns with the commented-out scenario imports, focusing only on trivial scenarios for initial testing of the benchmarking infrastructure.

172-178: Updated benchmark initialization with result directory path.

The benchmark is now initialized with ManipulationO3DEBenchmark and passes the out_dir as the results_dir parameter, aligning with the new BaseBenchmark inheritance structure that standardizes result handling.

179-179: Improved iteration by using direct scenario objects.

Changed from index-based iteration to direct iteration over scenario objects, which is a more Pythonic approach and improves code readability.

181-182: Improved system prompt handling with task-specific prompts.

Instead of using a fixed, hardcoded system prompt, the code now uses each scenario's task-specific system prompt, making the code more modular and flexible.

202-202: Main entry point updated to use new parameter.

The out_dir parameter is now properly passed to the run_benchmark function, completing the parameter flow from command line to execution.

src/rai_bench/rai_bench/manipulation_o3de/interfaces.py (3)

45-45: Improved method naming with calculate_score instead of calculate_result.

The renaming from calculate_result to calculate_score better reflects the method's purpose - calculating a normalized score between 0.0 and 1.0 rather than a general result. This improves code clarity and follows the standardized scoring terminology established across benchmarks.

Also applies to: 82-82, 492-492

57-65: Good separation of system prompt and task prompt.

Splitting the single get_prompt method into two separate properties (system_prompt and task_prompt) follows best practices for LLM prompt engineering. This separation allows for:

More modular prompt management

Consistent system instructions across different tasks

Task-specific prompts that focus only on the specific instructions

This design improvement aligns with modern LLM agent architectures and enables more flexible prompt handling.

380-389: Well-defined system prompt implementation.

The implementation of the system_prompt property in the ManipulationTask class provides clear, detailed instructions for the robotic arm, including:

Description of the agent's role

Explanation of the coordinate system

Important initial actions to take

Moving this from a hardcoded string in the main script to the task class itself follows the single responsibility principle and improves maintainability.

src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py (1)

29-65: Excellent structured result tracking with Pydantic models.

The introduction of Pydantic models for result tracking (SubTaskResult, ValidatorResult, TaskResult) is a significant improvement that:

Enforces data validation through type checking

Provides clear documentation with field descriptions

Creates a well-structured hierarchy mirroring the benchmark's logical structure

Enables consistent serialization/deserialization for results storage

Supports the visualization and analysis goals mentioned in the PR

The hierarchical organization (Task > Validator > SubTask) accurately reflects the benchmark structure and will make result analysis more straightforward. The comprehensive field descriptions ensure that the purpose of each attribute is clear.

This is a high-quality implementation that modernizes the result tracking infrastructure.

src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (1)

169-170: Summary CSV is truncated every iteration

compute_and_save_summary() calls csv_initialize, which opens the summary file with mode "w".
Running it after each task means only the latest summary row is kept.

If your intent is “one row per incremental snapshot” then switch csv_initialize to append-if-exists (see earlier comment) or call it only once at startup.
src/rai_bench/rai_bench/results_processing/data_processing.py (1)

82-95:
❌ Incorrect review comment

Potential attribute mismatch: extra_tool_calls vs extra_tool_calls_used

Filtering uses r.extra_tool_calls, aggregation uses r.extra_tool_calls_used.

If TaskResult exposes only one of those attributes the code will raise AttributeError or silently filter nothing.
-    filtered_results = [
-        r for r in all_detailed_results if r.extra_tool_calls == extra_calls
-    ]
+    filtered_results = [
+        r for r in all_detailed_results if r.extra_tool_calls_used == extra_calls
+    ]
Please verify the canonical field name in TaskResult and use it consistently throughout this module (also lines 268-270).

🏁 Script executed:
#!/usr/bin/env bash
set -e

# Locate TaskResult class definition
rg -n "class TaskResult" .

# Extract the file defining TaskResult
TASK_RESULT_FILE=$(rg -l "class TaskResult" .)

# Show occurrences of both attributes in that file
rg -n "extra_tool_calls" "$TASK_RESULT_FILE"
rg -n "extra_tool_calls_used" "$TASK_RESULT_FILE"

# Show all references of both attributes across the codebase
rg -n "extra_tool_calls" .
rg -n "extra_tool_calls_used" .
Length of output: 743

🏁 Script executed:
#!/usr/bin/env bash
set -e

# Files where TaskResult is defined
FILES=(
  "src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py"
  "src/rai_bench/rai_bench/tool_calling_agent/messages/actions.py"
)

for file in "${FILES[@]}"; do
  echo "=== File: $file ==="
  rg -n "extra_tool_calls" "$file" || echo "no matches for extra_tool_calls"
  rg -n "extra_tool_calls_used" "$file" || echo "no matches for extra_tool_calls_used"
done

echo "=== Repository-wide occurrences ==="
rg -n "extra_tool_calls" . || true
rg -n "extra_tool_calls_used" . || true
Length of output: 20101
No field mismatch here – both attributes are defined and used correctly

The TaskResult model in
src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py
defines both

extra_tool_calls (the allotted number of calls) and

extra_tool_calls_used (the actual number used)

The data‐processing logic intentionally

filters by r.extra_tool_calls (group by budget)

computes the average of r.extra_tool_calls_used (measure actual usage)

No change needed; you can safely ignore this suggestion.

Likely an incorrect or invalid review comment.
src/rai_bench/rai_bench/results_processing/visualise.py (1)

319-325: Y-axis range inconsistent with earlier success-rate charts

Earlier charts treat avg_success_rate as percentage (0–100) while this one clamps the
axis to 0–1. Please confirm which unit is intended and keep it consistent across all
plots to avoid misinterpretation.
-        y_range=(0.0, 1.0),
+        # If success_rate is a percentage, use 0–100; otherwise adjust upstream
+        y_range=(0.0, 100.0),

coderabbitai · 2025-04-30T11:21:14Z

src/rai_bench/rai_bench/examples/tool_calling_agent/main.py

+        "--extra-tool-calls",
+        type=str,
+        help="Number of extra tools calls agent can make and still pass the task",
+        default=0,
+    )


⚠️ Potential issue

Type mismatch in extra-tool-calls argument

The extra-tool-calls argument is defined as a string but used as an integer later in the code.

Fix the type to avoid conversion issues:

- type=str, + type=int,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"--extra-tool-calls",

type=str,

help="Number of extra tools calls agent can make and still pass the task",

default=0,

)

"--extra-tool-calls",

- type=str,

+ type=int,

help="Number of extra tools calls agent can make and still pass the task",

default=0,

)

coderabbitai · 2025-04-30T11:21:14Z

src/rai_bench/rai_bench/base_benchmark.py

+    def __init__(
+        self,
+        model_name: str,
+        results_dir: Path,
+        logger: logging.Logger | None = None,
+    ) -> None:
+        """Initialize the base benchmark.
+
+        Parameters
+        ----------
+        model_name : str
+            Name of the LLM model.
+        logger : Optional[loggers_type]
+            Logger instance.
+        results_filename : str
+            Path to the results file.
+        summary_filename : Optional[str]
+            Path to the summary file.
+        """
+        self.model_name = model_name
+        self.results_filename = results_dir / "results.csv"
+        self.summary_filename = results_dir / "results_summary.csv"
+
+        if logger:
+            self.logger = logger
+        else:
+            self.logger = logging.getLogger(__name__)
+


🛠️ Refactor suggestion

⚠️ Potential issue

Create results_dir early & fix outdated docstring parameters

results_dir may not exist yet – any attempt to open results.csv/results_summary.csv for writing will raise FileNotFoundError.
Also, the docstring still refers to the long-gone results_filename / summary_filename parameters, which is misleading.

@@ - self.model_name = model_name - self.results_filename = results_dir / "results.csv" - self.summary_filename = results_dir / "results_summary.csv" + self.model_name = model_name + + # Ensure the directory tree exists before we start opening files + results_dir.mkdir(parents=True, exist_ok=True) + + self.results_filename = results_dir / "results.csv" + self.summary_filename = results_dir / "results_summary.csv"

Please also update the parameter list in the docstring to match the actual signature.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def __init__(

self,

model_name: str,

results_dir: Path,

logger: logging.Logger | None = None,

) -> None:

"""Initialize the base benchmark.

Parameters

----------

model_name : str

Name of the LLM model.

logger : Optional[loggers_type]

Logger instance.

results_filename : str

Path to the results file.

summary_filename : Optional[str]

Path to the summary file.

"""

self.model_name = model_name

self.results_filename = results_dir / "results.csv"

self.summary_filename = results_dir / "results_summary.csv"

if logger:

self.logger = logger

else:

self.logger = logging.getLogger(__name__)

def __init__(

self,

model_name: str,

results_dir: Path,

logger: logging.Logger | None = None,

) -> None:

"""Initialize the base benchmark.

Parameters

----------

model_name : str

Name of the LLM model.

logger : Optional[loggers_type]

Logger instance.

results_filename : str

Path to the results file.

summary_filename : Optional[str]

Path to the summary file.

"""

self.model_name = model_name

# Ensure the directory tree exists before we start opening files

results_dir.mkdir(parents=True, exist_ok=True)

self.results_filename = results_dir / "results.csv"

self.summary_filename = results_dir / "results_summary.csv"

if logger:

self.logger = logger

else:

self.logger = logging.getLogger(__name__)

coderabbitai · 2025-04-30T11:21:14Z

src/rai_bench/rai_bench/manipulation_o3de/benchmark.py

+    def compute_and_save_summary(self) -> None:
+        """Compute summary statistics and save them to the summary file."""
+        self.logger.info("Computing and saving average results...")
+
+        success_count = sum(1 for r in self.scenario_results if r.score == 1.0)
+        success_rate = (
+            success_count / len(self.scenario_results) * 100
+            if self.scenario_results
+            else 0
+        )
+        avg_time = (
+            statistics.mean(r.total_time for r in self.scenario_results)
+            if self.scenario_results
+            else 0
+        )
+
+        # TODO (jm) extend this bechmark to implement extra tool calls
+        # since this benchmark doesn't have the concept of "extra tool calls",
+        # we use the total number of tool calls instead
+        total_tool_calls = sum(r.number_of_tool_calls for r in self.scenario_results)
+
+        summary = BenchmarkSummary(
+            model_name=self.model_name,
+            success_rate=round(success_rate, 2),
+            avg_time=round(avg_time, 3),
+            total_extra_tool_calls_used=total_tool_calls,
+            total_tasks=len(self.scenario_results),
+        )
+        self.csv_initialize(self.summary_filename, BenchmarkSummary)
+        self.csv_writerow(self.summary_filename, summary)



🛠️ Refactor suggestion

⚠️ Potential issue

Repeated csv_initialize() may overwrite the summary file on every scenario run

compute_and_save_summary() invokes self.csv_initialize(self.summary_filename, BenchmarkSummary) each time it is called.
If the helper recreates the file (or truncates an existing one) this will continuously drop the previously-written summary rows, resulting in a CSV that finally contains only the last scenario’s aggregate, not the whole benchmark history.

- self.csv_initialize(self.summary_filename, BenchmarkSummary) - self.csv_writerow(self.summary_filename, summary) + # Initialise once (e.g. in __init__) – just append here + self.csv_writerow(self.summary_filename, summary)

Consider moving the initialise call to __init__, or protect it with an if not self.summary_filename.exists() guard inside a helper.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def compute_and_save_summary(self) -> None:

"""Compute summary statistics and save them to the summary file."""

self.logger.info("Computing and saving average results...")

success_count = sum(1 for r in self.scenario_results if r.score == 1.0)

success_rate = (

success_count / len(self.scenario_results) * 100

if self.scenario_results

else 0

)

avg_time = (

statistics.mean(r.total_time for r in self.scenario_results)

if self.scenario_results

else 0

)

# TODO (jm) extend this bechmark to implement extra tool calls

# since this benchmark doesn't have the concept of "extra tool calls",

# we use the total number of tool calls instead

total_tool_calls = sum(r.number_of_tool_calls for r in self.scenario_results)

summary = BenchmarkSummary(

model_name=self.model_name,

success_rate=round(success_rate, 2),

avg_time=round(avg_time, 3),

total_extra_tool_calls_used=total_tool_calls,

total_tasks=len(self.scenario_results),

)

self.csv_initialize(self.summary_filename, BenchmarkSummary)

self.csv_writerow(self.summary_filename, summary)

def compute_and_save_summary(self) -> None:

"""Compute summary statistics and save them to the summary file."""

self.logger.info("Computing and saving average results...")

success_count = sum(1 for r in self.scenario_results if r.score == 1.0)

success_rate = (

success_count / len(self.scenario_results) * 100

if self.scenario_results

else 0

)

avg_time = (

statistics.mean(r.total_time for r in self.scenario_results)

if self.scenario_results

else 0

)

# TODO (jm) extend this bechmark to implement extra tool calls

# since this benchmark doesn't have the concept of "extra tool calls",

# we use the total number of tool calls instead

total_tool_calls = sum(r.number_of_tool_calls for r in self.scenario_results)

summary = BenchmarkSummary(

model_name=self.model_name,

success_rate=round(success_rate, 2),

avg_time=round(avg_time, 3),

total_extra_tool_calls_used=total_tool_calls,

total_tasks=len(self.scenario_results),

)

# Initialise once (e.g. in __init__) – just append here

self.csv_writerow(self.summary_filename, summary)

coderabbitai · 2025-04-30T11:21:14Z

tests/rai_bench/tool_calling_agent/test_validators.py

+    def validate(self, tool_call: Dict[str, Any]) -> bool:
        if self.specific_tool and tool_call.name != self.specific_tool:
            raise SubTaskValidationError(
                f"Expected tool {self.specific_tool}, got {tool_call.name}"
            )

+        if self._outcomes is not None:
+            try:
+                should_pass = next(self._outcomes)
+            except StopIteration:
+                # if run out, default to True
+                should_pass = True
+        else:
+            should_pass = True
+
+        if not should_pass:
+            raise SubTaskValidationError(f"error in {self.name}")
+


⚠️ Potential issue

tool_call type annotation is misleading – causes mypy/IDE noise

DummySubTask.validate is declared as:

def validate(self, tool_call: Dict[str, Any]) -> bool:

but every test passes a ToolCall instance whose .name attribute is accessed.
Changing the annotation to Any | "ToolCall" (or defining a proper protocol) will remove false-positive type errors.

- def validate(self, tool_call: Dict[str, Any]) -> bool: + from typing import Protocol + + class _HasName(Protocol): + name: str + + def validate(self, tool_call: _HasName) -> bool: # noqa: D401

coderabbitai · 2025-04-30T11:21:14Z

src/rai_bench/rai_bench/results_processing/data_loading.py

+BECNMARKS_CONVERTERS: Dict[str, Any] = {
+    "tool_calling_agent": convert_row_to_task_result,
+    "manipulation_o3de": convert_row_to_scenario_result,
+}
+


⚠️ Potential issue

Typo – constant name

BECNMARKS_CONVERTERS contains a transposed “N” and “H”. Typo-prone names hurt discoverability.

-BECNMARKS_CONVERTERS: Dict[str, Any] = { +BENCHMARK_CONVERTERS: Dict[str, Any] = {

Remember to update all call-sites (load_detailed_data, etc.).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

BECNMARKS_CONVERTERS: Dict[str, Any] = {

"tool_calling_agent": convert_row_to_task_result,

"manipulation_o3de": convert_row_to_scenario_result,

}

BENCHMARK_CONVERTERS: Dict[str, Any] = {

"tool_calling_agent": convert_row_to_task_result,

"manipulation_o3de": convert_row_to_scenario_result,

}

coderabbitai · 2025-04-30T11:21:14Z

src/rai_bench/rai_bench/results_processing/data_loading.py

+def load_detailed_data(file_path: str, benchmark: str) -> List[TaskResult]:
+    df = pd.read_csv(file_path)  # type: ignore
+    task_results: List[TaskResult] = []
+
+    converter = BECNMARKS_CONVERTERS[benchmark]
+    for _, row in df.iterrows():  # type: ignore
+        task_result = converter(row)
+        task_results.append(task_result)
+
+    return task_results


🛠️ Refactor suggestion

Return type is incorrect when loading non-TCA benchmarks

load_detailed_data() is annotated to return List[TaskResult] but for "manipulation_o3de" it actually returns ScenarioResult objects. This mis-specification will surface in static analysis and IDEs.

-from typing import Any, Dict, List, Optional, Tuple +from typing import Any, Dict, List, Optional, Tuple, Union ... -def load_detailed_data(...) -> List[TaskResult]: +def load_detailed_data(...) -> List[Union[TaskResult, ScenarioResult]]:

coderabbitai · 2025-04-30T11:21:14Z

src/rai_bench/rai_bench/results_processing/visualise.py

+def render_manipulation_o3de(bench_results: BenchmarkResults): ...
+
+


⚠️ Potential issue

render_manipulation_o3de is a no-op – UI will appear blank

The stub returns Ellipsis, so selecting the manipulation_o3de benchmark
renders nothing but also gives users no feedback.

-def render_manipulation_o3de(bench_results: BenchmarkResults): ... +def render_manipulation_o3de(bench_results: BenchmarkResults): + st.header("Manipulation O3DE Benchmark") + st.info("Detailed visualisations for this benchmark are not implemented yet.")

Until proper charts are added, showing an explanatory message prevents confusion.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def render_manipulation_o3de(bench_results: BenchmarkResults): ...

def render_manipulation_o3de(bench_results: BenchmarkResults):

st.header("Manipulation O3DE Benchmark")

st.info("Detailed visualisations for this benchmark are not implemented yet.")

jmatejcz force-pushed the jm/feat/rai_bench_results_gathering_and_visualization branch 3 times, most recently from ba31482 to 3d74cf8 Compare April 29, 2025 09:24

coderabbitai bot reviewed Apr 29, 2025

View reviewed changes

jmatejcz force-pushed the jm/feat/rai_bench_results_gathering_and_visualization branch 4 times, most recently from a02955e to c7d7edc Compare April 30, 2025 10:35

jmatejcz added 17 commits April 30, 2025 12:50

feat: added output dir as arg

19f756e

feat: changed results gathering and saving

e2b982f

feat: visualize results basic

f6a8051

feat: saving type of tasks to results

bbe0690

visualize tasks per type

feat: extra tool calls as arg to tool calling benchmark

60b3257

tests: update validators tests

2b26819

fixed number of tool calls and extra calss being returned

tests: update subtask tasks

97ba573

feat: add total extra tool calls used

60d2e7a

visualised results by extra tool call aviailable dynamic adjustments of bar width

feat: splitted code into seperate files

7e370d7

loading results into data models for more clarity

docs: README update

0a52681

feat: wrapping long text in plots

97e2cce

fix: counting only 100% passed tasks as successfull

8adb701

fix: assigning extra tool calls used fixed in validators

dc5d603

refactor: reseting values before every validate call in validators

160ec4e

added unit test for that case

refactor: applied review suggestion to visualise script

48a528d

refactor: unified gathering and saving results for both benchmarks

ae3f14a

new base benchmark class update visualise to manipulation bench

chore: adjust running script

4f10bae

jmatejcz force-pushed the jm/feat/rai_bench_results_gathering_and_visualization branch from c7d7edc to 4f10bae Compare April 30, 2025 11:12

coderabbitai bot reviewed Apr 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: benchmarks - results gathering and visualization #542

feat: benchmarks - results gathering and visualization #542

jmatejcz commented Apr 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 29, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

coderabbitai bot left a comment

jmatejcz commented Apr 30, 2025

coderabbitai bot left a comment

coderabbitai bot Apr 30, 2025

coderabbitai bot Apr 30, 2025

coderabbitai bot Apr 30, 2025

coderabbitai bot Apr 30, 2025

coderabbitai bot Apr 30, 2025

coderabbitai bot Apr 30, 2025

coderabbitai bot Apr 30, 2025

		def render_manipulation_o3de(bench_results: BenchmarkResults): ...

feat: benchmarks - results gathering and visualization #542

Are you sure you want to change the base?

feat: benchmarks - results gathering and visualization #542

Conversation

jmatejcz commented Apr 24, 2025 • edited by coderabbitai bot Loading

Purpose

Proposed Changes

Issues

Testing

Summary by CodeRabbit

coderabbitai bot commented Apr 29, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

jmatejcz commented Apr 30, 2025

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Apr 30, 2025

Choose a reason for hiding this comment

coderabbitai bot Apr 30, 2025

Choose a reason for hiding this comment

coderabbitai bot Apr 30, 2025

Choose a reason for hiding this comment

coderabbitai bot Apr 30, 2025

Choose a reason for hiding this comment

coderabbitai bot Apr 30, 2025

Choose a reason for hiding this comment

coderabbitai bot Apr 30, 2025

Choose a reason for hiding this comment

coderabbitai bot Apr 30, 2025

Choose a reason for hiding this comment

jmatejcz commented Apr 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 29, 2025 •

edited

Loading