-
Notifications
You must be signed in to change notification settings - Fork 39
feat: benchmarks - results gathering and visualization #542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development
Are you sure you want to change the base?
feat: benchmarks - results gathering and visualization #542
Conversation
ba31482
to
3d74cf8
Compare
WalkthroughThis update introduces a comprehensive overhaul of the benchmarking infrastructure, focusing on result tracking, analysis, and visualization. It standardizes result schemas using Pydantic models, establishes a new Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant StreamlitUI
participant DataLoader
participant DataProcessor
participant Plotly
User->>StreamlitUI: Select run/benchmark/model
StreamlitUI->>DataLoader: Load results from disk
DataLoader-->>StreamlitUI: Structured RunResults/BenchmarkResults
StreamlitUI->>DataProcessor: Aggregate/analyze results
DataProcessor-->>StreamlitUI: DataFrames/statistics
StreamlitUI->>Plotly: Generate charts
Plotly-->>StreamlitUI: Rendered plots
StreamlitUI-->>User: Display tables, charts, analysis
sequenceDiagram
participant Benchmark (BaseBenchmark)
participant Task/Scenario
participant Validator
participant CSV
participant Summary
loop For each task/scenario
Benchmark->>Task/Scenario: Run task/scenario
Task/Scenario->>Validator: Validate output
Validator-->>Task/Scenario: Validation results/errors
Task/Scenario-->>Benchmark: Score, details
Benchmark->>CSV: Save detailed result
end
Benchmark->>Summary: Compute and save summary stats
Summary-->>CSV: Write summary CSV
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🔭 Outside diff range comments (1)
src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (1)
554-572
:⚠️ Potential issueGuard against empty
validators
list to avoid ZeroDivisionError
len(self.validators)
can legitimately be0
(e.g. mis-configured task).
In that situation a division by zero will raise, crashing the whole benchmark run.- done_properly = 0 - for validator in self.validators: + if not self.validators: + raise ValueError( + f"Task '{self.__class__.__name__}' has no validators configured." + ) + + done_properly = 0 + for validator in self.validators: ... - return done_properly / len(self.validators) + return done_properly / len(self.validators)
🧹 Nitpick comments (16)
src/rai_bench/README.md (1)
185-210
: Documentation update provides clear instructions for results visualizationThe new "Results and Visualization" section clearly explains where benchmark results are stored and how to visualize them using Streamlit. However, there are a couple of minor grammar issues that should be fixed.
Consider these minor grammar improvements:
-If you run single benchmark test like: +If you run a single benchmark test like: -Results will be saved to dedicated directory named `<benchmark_name>` +Results will be saved to a dedicated directory named `<benchmark_name>`🧰 Tools
🪛 LanguageTool
[uncategorized] ~189-~189: You might be missing the article “a” here.
Context: ...](./rai_bench/experiments/) If you run single benchmark test like: When you run a tes...(AI_EN_LECTOR_MISSING_DETERMINER_A)
[uncategorized] ~196-~196: You might be missing the article “a” here.
Context: ...>/main.py ``` Results will be saved to dedicated directory named<benchmark_name>
Whe...(AI_EN_LECTOR_MISSING_DETERMINER_A)
src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (2)
205-207
:errors
already contains lists – avoid an extra level of nesting
send_score
acceptsList[List[str]]
, while you currently pass a list of (possibly empty) lists collected from every sub-task.
If a validator has no errors you still insert[]
, inflating the payload and producing confusing semicolons in the comment string ("; "
).- errors = [ - s.errors - for validator_info in validation_info - for s in validator_info.subtasks - ] + errors = [ + s.errors + for validator_info in validation_info + for s in validator_info.subtasks + if s.errors # keep only non-empty error lists + ]
214-220
: Name shadowing with builtintype
may bite downstream
TaskResult
exposes a field namedtype
. Using the same identifier as the builtin can lead to accidental shadowing when you later accessresult.type
inside comprehensions or debug statements.Consider renaming the attribute to
task_type
(and applying the same alias in the Pydantic model) to keep the public surface unambiguous.No diff provided because this change ripples through multiple files.
src/rai_bench/rai_bench/tool_calling_agent/scores_tracing.py (2)
42-65
: Stronger validation for numerical fieldsYou document
score
as “value between 0 and 1” andextra_tool_calls_used
as non-negative, but no constraints enforce this.
Pydantic has built-ins that can prevent illegal data early and surface bugs quicker:-from pydantic import BaseModel, Field +from pydantic import BaseModel, Field, NonNegativeInt, confloat- extra_tool_calls_used: int = Field( + extra_tool_calls_used: NonNegativeInt = Field( ..., description="Total number of extra tool calls used in this Task" ) - score: float = Field( - ..., - description="Value between 0 and 1, describing how many validation setps passed", - ) + score: confloat(ge=0.0, le=1.0) = Field( + ..., description="Fraction of validation steps passed (0-1)." + )The same applies to the summary model.
This small addition guards against negative counts or impossible scores leaking into persisted CSVs or analytics.
66-76
: Field nametype
conflicts with the builtinLike in
TaskResult
, usingtype
as an attribute may lead to subtle bugs when serialising/deserialising or during introspection.
Renaming it totask_type
(with an alias if you need the CSV header unchanged) will avoid accidental shadowing.task_type: str = Field(..., alias="type", description="Category of the task")src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (2)
29-33
: Superfluoussuper().__init__()
?
SubTask
currently has no explicit__init__
; callingsuper().__init__()
therefore just dispatches toobject.__init__()
and brings no value while slightly obscuring intent.Unless you plan to add shared initialisation in the base class, consider removing these calls to keep the constructors lean.
45-51
:info
property is read-only – document itMigrating from
dump()
to a read-onlyinfo
property is a nice cleanup.
Add a docstring to the property so consumers know it is side-effect-free and cheap to call.- @property - def info(self) -> Dict[str, Any]: + @property + def info(self) -> Dict[str, Any]: + """Static description of the expected tool-call arguments."""tests/rai_bench/tool_calling_agent/test_validators.py (2)
35-46
:DummySubTask.validate
type annotation is misleadingThe annotation says
tool_call: Dict[str, Any]
but the tests pass aToolCall
instance that exposes.name
.
Using the correct type improves static analysis and readability.- def validate(self, tool_call: Dict[str, Any]) -> bool: + def validate(self, tool_call: "ToolCall") -> bool:(Quotes avoid a forward-reference import at the top of the test file.)
41-42
: Unnecessarysuper().__init__()
inside a plain class
DummySubTask
does not inherit fromSubTask
; callingsuper().__init__()
therefore does nothing.
Removing the call eliminates a minor source of confusion without affecting behaviour.- super().__init__()
src/rai_bench/rai_bench/tool_calling_agent/validators.py (1)
42-45
: Avoid shadowing the built-intype
identifierAlthough legal, naming the public property
type
hides Python’s built-intype
, which can lead to confusing stack-traces and makes IDE auto-completion noisier. Consider a more specific name, e.g.validator_type
.- @property - def type(self) -> str: - return "ordered" + @property + def validator_type(self) -> str: + return "ordered"src/rai_bench/rai_bench/results_processing/data_loading.py (2)
214-218
: Simplify boolean return – remove double negative
check_if_summaries_present
can be expressed more idiomatically and avoids the RuffSIM103
warning:- if not self.runs or not any(run.benchmark_summaries for run in self.runs): - return False - return True + return bool(self.runs) and any(run.benchmark_summaries for run in self.runs)🧰 Tools
🪛 Ruff (0.8.2)
215-217: Return the negated condition directly
Inline condition
(SIM103)
131-145
: Streamlit dependency leaks into a pure data-loading layerFunctions
load_detailed_data
andload_summary_data
callst.error
, tightly coupling this utility module to Streamlit and making it unusable in non-UI contexts (e.g., CLI batch processing, tests).Consider raising a custom exception or returning an
Either
-like result and leaving presentation-layer concerns (toast / log /st.error
) to the caller.src/rai_bench/rai_bench/results_processing/data_processing.py (1)
336-347
: Unused loop key – address RuffB007
and clarify intentVariable
key
isn’t used inside the loop and may confuse readers.
If the key is only needed for iteration you can replace it with_
; otherwise include it in the DataFrame row (e.g. assubtask_id
) for traceability.-for key, stats in subtask_stats.items(): +for _, stats in subtask_stats.items():or
for key, stats in subtask_stats.items(): subtask_rows.append( { + "subtask_id": key,
🧰 Tools
🪛 Ruff (0.8.2)
336-336: Loop control variable
key
not used within loop body(B007)
src/rai_bench/rai_bench/results_processing/visualise.py (3)
41-44
: Avoid hard-coded experiment directoryUsing a relative, hard-coded path makes the app brittle when the package is installed
elsewhere or executed from a different working directory.
Expose it via CLI flag, environment variable, or Streamlit sidebar.-EXPERIMENT_DIR = "./src/rai_bench/rai_bench/experiments" +EXPERIMENT_DIR = os.getenv( + "RAI_BENCH_EXPERIMENT_DIR", + os.path.join(os.getcwd(), "experiments"), # sensible default +)
42-44
: Remove unused constants
DETAILED_FILE_NAME
andSUMMARY_FILE_NAME
are defined but never referenced in this
module, which is a small smell and may confuse future readers.
603-628
: Duplicate sub-task rendering code – use the helper insteadThe block repeats the logic already encapsulated in
display_subtask_analysis()
.
Keeping two copies risks divergence and increases maintenance cost.- # Subtasks table - ... - st.dataframe(error_df, use_container_width=True) # type:ignore - - # Additional analysis: group errors by tool name - if not error_df.empty: - ... - st.dataframe(error_by_tool_df, use_container_width=True) # type:ignore + # Re-use the common helper to avoid duplication + display_subtask_analysis(subtask_df, tool_name_df, error_df)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
src/rai_bench/poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (22)
src/rai_bench/README.md
(1 hunks)src/rai_bench/pyproject.toml
(1 hunks)src/rai_bench/rai_bench/examples/manipulation_o3de/main.py
(4 hunks)src/rai_bench/rai_bench/examples/test_models.py
(1 hunks)src/rai_bench/rai_bench/examples/tool_calling_agent/main.py
(4 hunks)src/rai_bench/rai_bench/examples/tool_calling_agent/tasks.py
(2 hunks)src/rai_bench/rai_bench/results_processing/__init__.py
(1 hunks)src/rai_bench/rai_bench/results_processing/data_loading.py
(1 hunks)src/rai_bench/rai_bench/results_processing/data_processing.py
(1 hunks)src/rai_bench/rai_bench/results_processing/visualise.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/benchmark.py
(3 hunks)src/rai_bench/rai_bench/tool_calling_agent/interfaces.py
(5 hunks)src/rai_bench/rai_bench/tool_calling_agent/scores_tracing.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py
(2 hunks)src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py
(8 hunks)src/rai_bench/rai_bench/tool_calling_agent/validators.py
(3 hunks)tests/rai_bench/tool_calling_agent/test_subtasks.py
(1 hunks)tests/rai_bench/tool_calling_agent/test_validators.py
(14 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (9)
src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py (1)
src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (1)
Task
(447-571)
tests/rai_bench/tool_calling_agent/test_subtasks.py (3)
tests/rai_bench/tool_calling_agent/test_validators.py (1)
info
(68-69)src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (1)
info
(56-58)src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (4)
info
(46-51)info
(85-91)info
(125-131)info
(165-171)
src/rai_bench/rai_bench/examples/manipulation_o3de/main.py (1)
src/rai_bench/rai_bench/examples/tool_calling_agent/main.py (2)
parse_args
(27-50)run_benchmark
(53-90)
src/rai_bench/rai_bench/examples/test_models.py (2)
src/rai_bench/rai_bench/examples/tool_calling_agent/main.py (1)
run_benchmark
(53-90)src/rai_bench/rai_bench/examples/manipulation_o3de/main.py (1)
run_benchmark
(68-210)
tests/rai_bench/tool_calling_agent/test_validators.py (3)
src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (7)
SubTaskValidationError
(28-29)Validator
(353-444)validate
(51-52)validate
(443-444)validate
(554-571)info
(56-58)dump_results
(421-440)src/rai_bench/rai_bench/tool_calling_agent/validators.py (4)
validate
(46-75)validate
(95-144)OrderedCallsValidator
(29-75)NotOrderedCallsValidator
(78-144)tests/rai_bench/tool_calling_agent/test_subtasks.py (2)
validate
(37-38)info
(44-45)
src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (3)
tests/rai_bench/tool_calling_agent/test_subtasks.py (1)
info
(44-45)tests/rai_bench/tool_calling_agent/test_validators.py (1)
info
(68-69)src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (1)
info
(56-58)
src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (4)
src/rai_bench/rai_bench/tool_calling_agent/scores_tracing.py (2)
SubTaskResult
(29-32)ValidatorResult
(35-39)tests/rai_bench/tool_calling_agent/test_subtasks.py (1)
info
(44-45)tests/rai_bench/tool_calling_agent/test_validators.py (1)
info
(68-69)src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (4)
info
(46-51)info
(85-91)info
(125-131)info
(165-171)
src/rai_bench/rai_bench/tool_calling_agent/validators.py (4)
tests/rai_bench/tool_calling_agent/test_subtasks.py (1)
validate
(37-38)tests/rai_bench/tool_calling_agent/test_validators.py (2)
validate
(47-65)ToolCall
(28-31)src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (7)
validate
(51-52)validate
(443-444)validate
(554-571)SubTaskValidationError
(28-29)add_subtask_errors
(403-409)required_calls
(400-401)required_calls
(525-530)src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (4)
validate
(34-43)validate
(68-82)validate
(108-122)validate
(148-162)
src/rai_bench/rai_bench/results_processing/visualise.py (2)
src/rai_bench/rai_bench/results_processing/data_loading.py (4)
ModelResults
(207-258)RunResults
(273-282)get_available_runs
(177-195)load_benchmark_results
(321-380)src/rai_bench/rai_bench/results_processing/data_processing.py (7)
analyze_subtasks
(274-392)create_extra_calls_dataframe
(57-106)create_model_summary_dataframe
(28-54)create_task_details_dataframe
(172-213)create_task_metrics_dataframe
(118-169)get_available_extra_tool_calls
(250-271)get_unique_values_from_results
(216-247)
🪛 LanguageTool
src/rai_bench/README.md
[uncategorized] ~189-~189: You might be missing the article “a” here.
Context: ...](./rai_bench/experiments/) If you run single benchmark test like: When you run a tes...
(AI_EN_LECTOR_MISSING_DETERMINER_A)
[uncategorized] ~196-~196: You might be missing the article “a” here.
Context: ...>/main.py ``` Results will be saved to dedicated directory named <benchmark_name>
Whe...
(AI_EN_LECTOR_MISSING_DETERMINER_A)
🪛 Ruff (0.8.2)
src/rai_bench/rai_bench/results_processing/data_loading.py
215-217: Return the negated condition directly
Inline condition
(SIM103)
src/rai_bench/rai_bench/results_processing/data_processing.py
336-336: Loop control variable key
not used within loop body
(B007)
🔇 Additional comments (32)
src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py (1)
498-502
: Consistent Task Typing withtype
Property
Adding thetype
property alignsNavigationTask
with other task classes, enabling uniform task identification across validation, reporting, and visualization pipelines. This supports filtering and grouping by task type in downstream modules.src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py (1)
47-50
: Standardizing Task Metadata withtype
Property
The newtype
property ensuresBasicTask
is identified as"basic"
in result structures, maintaining consistency across the tool calling agent framework.src/rai_bench/rai_bench/results_processing/__init__.py (1)
1-14
: Package Marker File Added
The__init__.py
file correctly marksresults_processing
as a Python package, enabling imports of the new data loading, processing, and visualization modules. No further action needed here.src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py (1)
48-52
: Addingtype
Property toManipulationTask
Introducing thetype
property standardizes task categorization as"manipulation"
, aligning with other task classes and facilitating consistent handling in the benchmarking and visualization modules.src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py (1)
55-57
: Implementation oftype
property aligns with task identification standardsThe addition of the read-only
type
property returns a consistent string identifier for this task class, which aligns with the abstract property requirement from the baseTask
class and standardizes task type identification across the benchmark framework.src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py (3)
837-841
: Good abstraction: new base class for custom interface tasksCreating this abstract base class improves the inheritance hierarchy by providing a common implementation of the
type
property for all custom interface tasks. This reduces code duplication and standardizes type identification.
843-843
: Class inheritance updated correctlyThe
CustomInterfacesTopicTask
now inherits from the newCustomInterfaceTask
abstract base class, ensuring it receives the standardizedtype
property implementation.
876-876
: Class inheritance updated correctlyThe
CustomInterfacesServiceTask
now inherits from the newCustomInterfaceTask
abstract base class, ensuring it receives the standardizedtype
property implementation.tests/rai_bench/tool_calling_agent/test_subtasks.py (1)
43-45
: Test fixture updated to match implementation changesAdding the
info
property to theConcreteSubTask
test class aligns with the framework's refactoring fromdump()
methods toinfo
properties. This ensures the test fixtures correctly mirror the actual implementation patterns.src/rai_bench/rai_bench/examples/manipulation_o3de/main.py (5)
58-64
: Good addition of configurable output directory parameter.The addition of a timestamped default output directory improves results organization and makes the benchmark more flexible.
68-68
: Function signature updated to support output directory parameter.The
run_benchmark
function signature is correctly updated to accept the newout_dir
parameter, which aligns with similar changes in other benchmark modules.
102-103
: Properly creating output directory structure.The code correctly creates the output directory and ensures parent directories exist, which is a good practice for file operations.
185-185
: Results file path constructed from output directory.The results file path is properly derived from the output directory parameter, ensuring consistent file organization.
215-215
: Correctly passing output directory parameter to benchmark function.The
out_dir
parameter is properly passed from the command line arguments to the benchmark function, completing the implementation.src/rai_bench/rai_bench/examples/test_models.py (5)
14-14
: Added datetime import for timestamp generation.The datetime module is correctly imported to support generation of timestamped output directories.
22-23
: Modified benchmark configuration lists.The benchmark list is now focused on "manipulation_o3de" and a new extra_tool_calls list has been added with a value of 5. This change supports the enhanced parameterized benchmark runs.
26-29
: Implemented timestamped base output directory.Creating a base directory with timestamp ensures unique and organized storage of benchmark results for each run.
36-45
: Enhanced directory structure creation for benchmark runs.The nested loops now properly handle multiple benchmark configurations by creating a structured directory hierarchy based on benchmark type, model name, and extra tool calls. This organization will make result analysis more manageable.
46-65
: Added robust error handling for benchmark execution.The try-except block around benchmark runs is a good addition that prevents a single failed benchmark from stopping the entire test suite. The error message includes detailed information about which benchmark failed and why.
src/rai_bench/rai_bench/examples/tool_calling_agent/main.py (6)
23-23
: Switched to dynamic task generation.Changed from importing static
all_tasks
to the dynamicget_all_tasks
function, which supports parameterized task creation.
33-35
: Made model-name and vendor parameters required.Correctly enforced required parameters for benchmark execution, which prevents errors from missing critical configuration.
37-48
: Added configurable extra-tool-calls and output directory parameters.These new parameters enhance benchmark flexibility:
extra-tool-calls
allows configuring tolerance for additional tool callsout_dir
allows specifying where results are storedThe default output directory uses a timestamp format consistent with other benchmark scripts.
53-53
: Updated function signature to support new parameters.Properly updated the
run_benchmark
function to accept the newout_dir
andextra_tool_calls
parameters, enabling the new functionality.
74-74
: Implemented dynamic task generation with extra tool calls.Now correctly uses the
get_all_tasks
function with the specifiedextra_tool_calls
parameter, enabling dynamic task generation based on the configuration.
95-100
: Updated main execution to pass new parameters.The main execution block properly passes all parameters to the
run_benchmark
function, completing the implementation of the enhanced benchmark configuration.src/rai_bench/rai_bench/examples/tool_calling_agent/tasks.py (7)
16-16
: Added Sequence to imports.Added
Sequence
import to support return type annotation in theget_spatial_tasks
function.
295-330
: Refactored basic tasks into dynamic generation function.Converted static basic task list into a function that generates tasks with configurable
extra_tool_calls
parameter. This enables more flexible benchmarking with varying tolerance levels.
333-351
: Refactored navigation tasks into dynamic generation function.Converted static navigation task list into a function that generates tasks with configurable
extra_tool_calls
parameter. Consistently applies the same pattern as other task types.
354-366
: Refactored manipulation tasks into dynamic generation function.Converted static manipulation task list into a function that generates tasks with configurable
extra_tool_calls
parameter, maintaining consistency with the other task type refactorings.
369-383
: Refactored custom interface tasks into dynamic generation function.Converted static custom interface task list into a function that generates tasks with configurable
extra_tool_calls
parameter, continuing the pattern established for other task types.
386-403
: Implemented spatial tasks dynamic generation function.Created a function to dynamically generate both true and false spatial tasks with configurable
extra_tool_calls
parameter. Note the return type isSequence[Task]
rather thanList[Task]
to accommodate the combined list return.
406-415
: Implemented unified task generation function.Created a comprehensive
get_all_tasks
function that:
- Collects tasks from all specialized task generators
- Passes the
extra_tool_calls
parameter to each- Shuffles the tasks for randomized execution
- Returns the combined task list
This enables flexible benchmark configuration through a single parameter.
a02955e
to
c7d7edc
Compare
visualize tasks per type
fixed number of tool calls and extra calss being returned
visualised results by extra tool call aviailable dynamic adjustments of bar width
loading results into data models for more clarity
added unit test for that case
new base benchmark class update visualise to manipulation bench
c7d7edc
to
4f10bae
Compare
@CodeRabbit full review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🔭 Outside diff range comments (3)
src/rai_bench/rai_bench/examples/tool_calling_agent/tasks.py (1)
295-416
: 🛠️ Refactor suggestionGood refactoring to dynamic task generation with configurable tool calls.
Replacing static task lists with dynamic task-generating functions provides several benefits:
- Runtime configurability of
extra_tool_calls
parameter- Consistent parameter application across all task types
- More flexible benchmark configuration
- Better code organization with clear function boundaries
This change aligns well with the PR's goal of standardizing benchmark handling and supports the new structured results tracking.
However, having both the static task lists (lines 233-292) and the new dynamic functions creates duplication that could lead to confusion.
Consider removing the static task lists (
basic_tasks
,navigation_tasks
, etc.) as they're now replaced by the dynamic functions. If they're kept for backward compatibility, add a deprecation comment to make it clear they'll be removed in the future.src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (1)
29-33
:⚠️ Potential issueProperty vs. method – interface mismatch with
SubTask.info()
SubTask.info
is still defined as a method (see interfaces.py) but subclasses were converted to a property.
Call-sites written against the older contract will now raiseTypeError: 'dict' object is not callable
.Two options:
- Keep the property but update the base class + all usages:
-class SubTask(ABC): - def info(self) -> Dict[str, Any]: - ... +class SubTask(ABC): + @property + @abstractmethod + def info(self) -> Dict[str, Any]: + ...
- Revert these changes and keep
def info(self) -> Dict[str, Any]:
.Pick one style and apply it consistently across the codebase and tests.
Also applies to: 45-51
src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (1)
114-117
:⚠️ Potential issueUsing
HumanMultimodalMessage
for non-multimodal tasks may break some model back-endsWhen the task is not a
SpatialReasoningAgentTask
, the code still wraps the prompt inHumanMultimodalMessage
(without images).
Some model routes expect plainHumanMessage
and will reject multimodal payloads.Consider falling back to
HumanMessage
for non-image tasks:- for state in agent.stream( - {"messages": [HumanMultimodalMessage(content=task.get_prompt())]}, + for state in agent.stream( + {"messages": [HumanMessage(content=task.get_prompt())]},(or add
images=[]
if the backend tolerates it).
♻️ Duplicate comments (6)
src/rai_bench/pyproject.toml (1)
14-14
:⚠️ Potential issueMissing Streamlit Dependency
The visualization script (
visualise_results.py
) introduced in this PR relies on Streamlit, butpyproject.toml
only addsplotly
. You must addstreamlit
(and any other new visualization dependencies) under[tool.poetry.dependencies]
to avoid runtime import errors during result visualization.src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (2)
84-91
: Same issue as above forCheckTopicFieldsToolCallSubTask
Fix will be covered once
SubTask
contract is clarified.
124-131
: Same issue as above forCheckServiceFieldsToolCallSubTask
&CheckActionFieldsToolCallSubTask
See earlier comment.
src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (1)
63-65
: Ensureresults_dir
exists before CSV initializationIf the directory doesn’t exist,
csv_initialize
will fail here. Creating it inBaseBenchmark
(see earlier suggestion) solves the problem globally.src/rai_bench/rai_bench/tool_calling_agent/validators.py (1)
106-112
:⚠️ Potential issuePossible negative
extra_calls_used
in early-exit pathIf the agent finishes all subtasks exactly on the last allowed tool call,
i - self.required_calls
yields-1
.- self.extra_calls_used = i - self.required_calls + self.extra_calls_used = max(0, i - self.required_calls)This mirrors the fix already applied in
OrderedCallsValidator
.src/rai_bench/rai_bench/results_processing/visualise.py (1)
74-76
:⚠️ Potential issueBar width still shrinks for small bar counts (previous feedback not applied)
The scaling formula here makes bars narrower when the number of bars is below the
max_full_width_bars
threshold (e.g., 3 bars → width = 0.24).
In the earlier review we suggested the inverse logic so that full width is kept for small n and only shrinks when n exceeds the threshold.- scale = min(n_bars, max_full_width_bars) / max_full_width_bars - width = base_width * scale + # Keep full width when bar count is small; scale down only when crowded + scale = 1.0 if n_bars <= max_full_width_bars else max_full_width_bars / n_bars + width = base_width * scaleThis provides a consistent, more readable chart for short bar series.
🧹 Nitpick comments (21)
src/rai_bench/rai_bench/examples/test_models.py (1)
66-66
: Redundant continue statementThe
continue
statement at line 66 is unnecessary as it's at the end of the loop.- continue
src/rai_bench/rai_bench/base_benchmark.py (3)
67-83
:csv_initialize
overwrites the file each time – is that intentional?Every call opens the file with mode
"w"
which truncates any earlier summary.
If you intend to keep historical summaries (e.g. one row per incremental run), switch to append mode when the header already exists.- with open(filename, mode="w", newline="", encoding="utf-8") as file: + mode = "w" if not filename.exists() else "a" + with open(filename, mode=mode, newline="", encoding="utf-8") as file:If truncating is deliberate, please leave a comment explaining the rationale so future readers don’t treat this as a bug.
103-107
: Prefer the model class for headers
base_model_instance.__annotations__
relies on attribute look-up falling back to the class.
Being explicit is clearer and side-steps edge cases (e.g. a field called__annotations__
on the instance).- writer = csv.DictWriter( - file, fieldnames=base_model_instance.__annotations__.keys() - ) + writer = csv.DictWriter( + file, fieldnames=type(base_model_instance).__annotations__.keys() + )
125-125
: Open TODO or remove it
# TODO (jm) this can be probably same for all benchmark in the future
does not have an accompanying issue reference.
Either open a GitHub issue and link it here or replace the comment with a concrete plan; otherwise it will be forgotten.src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (2)
58-65
: Duplicate attribute –self.tasks_results
is unused
self.task_results
is declared on line 60 and actively used later.
self.tasks_results
(plural) on line 63 is never referenced and risks confusion.- self.tasks_results: List[TaskResult] = []
Delete the stray attribute.
172-179
: Guard against division / mean on empty listAlthough currently called after at least one task, future refactors (early summary, test harness, etc.) could invoke
compute_and_save_summary
before any results exist, causingZeroDivisionError
orStatisticsError
.- success_rate = success_count / len(self.task_results) * 100 - avg_time = statistics.mean(r.total_time for r in self.task_results) + total = len(self.task_results) or 1 # prevent div/0 + success_rate = success_count / total * 100 + avg_time = ( + statistics.mean(r.total_time for r in self.task_results) + if self.task_results + else 0.0 + )src/rai_bench/rai_bench/manipulation_o3de/benchmark.py (2)
235-241
: Exact float comparison can under-count successessuccess_count = sum(1 for r in self.scenario_results if r.score == 1.0)
calculate_score()
is likely to return a float. Relying on strict equality may miss values such as0.999999
that should semantically map to success.-success_count = sum(1 for r in self.scenario_results if r.score == 1.0) +EPS = 1e-6 +success_count = sum(1 for r in self.scenario_results if r.score >= 1.0 - EPS)At minimum, use
math.isclose()
with an absolute/relative tolerance or compare against a threshold.
166-204
:last_msg
is built but never used – remove or log it consistently
last_msg
is assigned inside the streaming loop, yet its only usage is inside a debug line under the same branch.
If the value is useful for downstream analysis, consider logging it in a structured way; otherwise drop the variable to avoid dead code and noisy conditionals.tests/rai_bench/tool_calling_agent/test_validators.py (1)
72-95
:assert_dumped
helper: valuable but missing negative-case coverageThe helper validates many fields but silently returns the
result
.
Adding assertions for unexpected keys (e.g. that no unknown fields appear) or validating dataclass types (if applicable) would catch accidental schema regressions early.src/rai_bench/rai_bench/results_processing/data_processing.py (2)
136-153
:getattr(result, group_by)
may raise – guard for invalidgroup_by
create_task_metrics_dataframe
trusts the caller to pass a valid attribute name.
A typo will raiseAttributeError
and break the whole visualisation pipeline.Consider:
if not hasattr(TaskResult, group_by): raise ValueError(f"Invalid group_by field '{group_by}'")or fallback to
None
and skip such rows.
300-334
: JSON key built but never used – unnecessary overhead
key = json.dumps(args, sort_keys=True)
is computed, yet onlykey not in subtask_stats
is evaluated; the string is never logged or returned.
Storing the raw dict (withfrozenset(args.items())
as key) would avoid serialisation overhead and keep look-ups fast.src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (5)
54-58
: Doc-string still references removeddump()
methodThe interface now exposes an
info
property instead of adump()
method, but the class-level doc-string (lines 42-48) still instructs implementers to overridedump()
. This is confusing for new contributors and tooling that relies on doc-strings.- dump() - Abstract method that subclasses must implement to serialize the subtask configuration. + info (property) + Returns a serialisable dictionary describing the sub-task’s expected arguments.
394-398
: Avoid using the built-in nametype
for public APIBoth
Validator
andTask
expose atype
property. Shadowing the built-intype
can be surprising in debugging sessions and hampers IDE code-completion. Consider renaming to something more explicit, e.g.validator_type
/task_type
.- @property - @abstractmethod - def type(self) -> str: + @property + @abstractmethod + def validator_type(self) -> str:(Apply the same rename in subclasses and everywhere the property is consumed.)
411-419
: Minor grammar & clarity inreset()
doc-string“resets all values refering previous validation” → “Reset all state from the previous validation run.”
(Only the comment needs updating; code logic is fine.)
421-450
:dump_results()
returns mutable objects that can be tampered with by callers
dump_results()
hands out the liveValidatorResult
object which in turn contains the sameSubTaskResult
instances held internally. External mutation after dumping will silently alter the validator’s history.If immutability is important consider returning a deep copy:
- result = ValidatorResult( + result = ValidatorResult( ... ) - return result + # prevent external mutation + return result.model_copy(deep=True)
515-517
: Method name suggests serialisation but only aggregates
dump_validators()
does not serialise data, it merely aggregatesValidatorResult
objects. A clearer name such ascollect_validator_results()
would make call-sites self-explanatory.src/rai_bench/rai_bench/results_processing/data_loading.py (2)
271-289
: Non-deterministic ordering of runs
os.listdir()
returns entries in arbitrary order which causes run ordering (and therefore visualisation order) to vary across platforms and executions. A simplesorted()
ensures deterministic output:- for d in os.listdir(experiment_dir) + for d in sorted(os.listdir(experiment_dir))
295-324
: Silent discard of malformed UUIDs
uuid.UUID(row["run_id"])
will raiseValueError
and abort the whole load if a single row contains an invalid UUID. Consider wrapping this intry/except
so that one bad row does not kill the entire dataset load, possibly logging a warning via Streamlit instead.src/rai_bench/rai_bench/tool_calling_agent/validators.py (2)
31-33
: Typo in class doc-string
subtaks
→subtasks
.
58-63
: Iterator exhaustion can lose the last errorWhen the final subtask fails on the last tool call, the
StopIteration
branch is never reached and the surroundingfor
loop exits normally, falling through to the generic “Validation failed” block. Variableu
still refers to the last subtask index, so this is fine; nonetheless consider explicitly logging the collectedself.subtasks_errors[u]
here to make debugging easier.src/rai_bench/rai_bench/results_processing/visualise.py (1)
149-154
: Unused accumulator introduces dead code
model_counts
is populated but never read afterwards. Removing it (or using it) will simplify the function and avoid confusion.- # Dictionary to store aggregated metrics for each model - model_metrics: Dict[str, Dict[str, Any]] = {} - - # Track total counts for averaging - model_counts: Dict[str, int] = {} + # Dictionary to store aggregated metrics for each model + model_metrics: Dict[str, Dict[str, Any]] = {}And delete the later
model_counts[model_name] += …
line.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Lite
⛔ Files ignored due to path filters (1)
src/rai_bench/poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (34)
src/rai_bench/README.md
(1 hunks)src/rai_bench/pyproject.toml
(1 hunks)src/rai_bench/rai_bench/base_benchmark.py
(1 hunks)src/rai_bench/rai_bench/examples/manipulation_o3de/main.py
(5 hunks)src/rai_bench/rai_bench/examples/manipulation_o3de/scenarios.py
(14 hunks)src/rai_bench/rai_bench/examples/test_models.py
(2 hunks)src/rai_bench/rai_bench/examples/tool_calling_agent/main.py
(4 hunks)src/rai_bench/rai_bench/examples/tool_calling_agent/tasks.py
(2 hunks)src/rai_bench/rai_bench/manipulation_o3de/benchmark.py
(5 hunks)src/rai_bench/rai_bench/manipulation_o3de/interfaces.py
(5 hunks)src/rai_bench/rai_bench/manipulation_o3de/results_tracking.py
(1 hunks)src/rai_bench/rai_bench/manipulation_o3de/tasks/build_tower_task.py
(1 hunks)src/rai_bench/rai_bench/manipulation_o3de/tasks/group_objects_task.py
(1 hunks)src/rai_bench/rai_bench/manipulation_o3de/tasks/move_object_to_left_task.py
(1 hunks)src/rai_bench/rai_bench/manipulation_o3de/tasks/place_at_coord_task.py
(1 hunks)src/rai_bench/rai_bench/manipulation_o3de/tasks/place_cubes_task.py
(1 hunks)src/rai_bench/rai_bench/manipulation_o3de/tasks/rotate_object_task.py
(1 hunks)src/rai_bench/rai_bench/results_processing/__init__.py
(1 hunks)src/rai_bench/rai_bench/results_processing/data_loading.py
(1 hunks)src/rai_bench/rai_bench/results_processing/data_processing.py
(1 hunks)src/rai_bench/rai_bench/results_processing/visualise.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/benchmark.py
(3 hunks)src/rai_bench/rai_bench/tool_calling_agent/interfaces.py
(5 hunks)src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py
(2 hunks)src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py
(8 hunks)src/rai_bench/rai_bench/tool_calling_agent/validators.py
(3 hunks)tests/rai_bench/manipulation_o3de/tasks/test_task.py
(1 hunks)tests/rai_bench/tool_calling_agent/test_subtasks.py
(1 hunks)tests/rai_bench/tool_calling_agent/test_validators.py
(14 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (14)
src/rai_bench/rai_bench/manipulation_o3de/tasks/group_objects_task.py (7)
src/rai_bench/rai_bench/manipulation_o3de/tasks/build_tower_task.py (1)
task_prompt
(79-81)src/rai_bench/rai_bench/manipulation_o3de/interfaces.py (1)
task_prompt
(64-68)src/rai_bench/rai_bench/manipulation_o3de/tasks/move_object_to_left_task.py (1)
task_prompt
(41-45)src/rai_bench/rai_bench/manipulation_o3de/tasks/place_at_coord_task.py (1)
task_prompt
(57-62)src/rai_bench/rai_bench/manipulation_o3de/tasks/place_cubes_task.py (1)
task_prompt
(49-50)src/rai_bench/rai_bench/manipulation_o3de/tasks/rotate_object_task.py (1)
task_prompt
(53-59)tests/rai_bench/manipulation_o3de/tasks/test_task.py (1)
task_prompt
(26-27)
src/rai_bench/rai_bench/manipulation_o3de/tasks/build_tower_task.py (7)
src/rai_bench/rai_bench/manipulation_o3de/tasks/group_objects_task.py (1)
task_prompt
(51-62)src/rai_bench/rai_bench/manipulation_o3de/interfaces.py (1)
task_prompt
(64-68)src/rai_bench/rai_bench/manipulation_o3de/tasks/move_object_to_left_task.py (1)
task_prompt
(41-45)src/rai_bench/rai_bench/manipulation_o3de/tasks/place_at_coord_task.py (1)
task_prompt
(57-62)src/rai_bench/rai_bench/manipulation_o3de/tasks/place_cubes_task.py (1)
task_prompt
(49-50)src/rai_bench/rai_bench/manipulation_o3de/tasks/rotate_object_task.py (1)
task_prompt
(53-59)tests/rai_bench/manipulation_o3de/tasks/test_task.py (1)
task_prompt
(26-27)
src/rai_bench/rai_bench/manipulation_o3de/tasks/place_at_coord_task.py (7)
src/rai_bench/rai_bench/manipulation_o3de/tasks/group_objects_task.py (1)
task_prompt
(51-62)src/rai_bench/rai_bench/manipulation_o3de/tasks/build_tower_task.py (1)
task_prompt
(79-81)src/rai_bench/rai_bench/manipulation_o3de/interfaces.py (1)
task_prompt
(64-68)src/rai_bench/rai_bench/manipulation_o3de/tasks/move_object_to_left_task.py (1)
task_prompt
(41-45)src/rai_bench/rai_bench/manipulation_o3de/tasks/place_cubes_task.py (1)
task_prompt
(49-50)src/rai_bench/rai_bench/manipulation_o3de/tasks/rotate_object_task.py (1)
task_prompt
(53-59)tests/rai_bench/manipulation_o3de/tasks/test_task.py (1)
task_prompt
(26-27)
src/rai_bench/rai_bench/examples/test_models.py (2)
src/rai_bench/rai_bench/examples/manipulation_o3de/main.py (1)
run_benchmark
(64-197)src/rai_bench/rai_bench/examples/tool_calling_agent/main.py (1)
run_benchmark
(53-92)
src/rai_bench/rai_bench/manipulation_o3de/tasks/rotate_object_task.py (7)
src/rai_bench/rai_bench/manipulation_o3de/tasks/group_objects_task.py (1)
task_prompt
(51-62)src/rai_bench/rai_bench/manipulation_o3de/tasks/build_tower_task.py (1)
task_prompt
(79-81)src/rai_bench/rai_bench/manipulation_o3de/interfaces.py (1)
task_prompt
(64-68)src/rai_bench/rai_bench/manipulation_o3de/tasks/move_object_to_left_task.py (1)
task_prompt
(41-45)src/rai_bench/rai_bench/manipulation_o3de/tasks/place_at_coord_task.py (1)
task_prompt
(57-62)src/rai_bench/rai_bench/manipulation_o3de/tasks/place_cubes_task.py (1)
task_prompt
(49-50)tests/rai_bench/manipulation_o3de/tasks/test_task.py (1)
task_prompt
(26-27)
src/rai_bench/rai_bench/manipulation_o3de/tasks/move_object_to_left_task.py (7)
src/rai_bench/rai_bench/manipulation_o3de/tasks/group_objects_task.py (1)
task_prompt
(51-62)src/rai_bench/rai_bench/manipulation_o3de/tasks/build_tower_task.py (1)
task_prompt
(79-81)src/rai_bench/rai_bench/manipulation_o3de/interfaces.py (1)
task_prompt
(64-68)src/rai_bench/rai_bench/manipulation_o3de/tasks/place_at_coord_task.py (1)
task_prompt
(57-62)src/rai_bench/rai_bench/manipulation_o3de/tasks/place_cubes_task.py (1)
task_prompt
(49-50)src/rai_bench/rai_bench/manipulation_o3de/tasks/rotate_object_task.py (1)
task_prompt
(53-59)tests/rai_bench/manipulation_o3de/tasks/test_task.py (1)
task_prompt
(26-27)
src/rai_bench/rai_bench/manipulation_o3de/tasks/place_cubes_task.py (7)
src/rai_bench/rai_bench/manipulation_o3de/tasks/group_objects_task.py (1)
task_prompt
(51-62)src/rai_bench/rai_bench/manipulation_o3de/tasks/build_tower_task.py (1)
task_prompt
(79-81)src/rai_bench/rai_bench/manipulation_o3de/interfaces.py (1)
task_prompt
(64-68)src/rai_bench/rai_bench/manipulation_o3de/tasks/move_object_to_left_task.py (1)
task_prompt
(41-45)src/rai_bench/rai_bench/manipulation_o3de/tasks/place_at_coord_task.py (1)
task_prompt
(57-62)src/rai_bench/rai_bench/manipulation_o3de/tasks/rotate_object_task.py (1)
task_prompt
(53-59)tests/rai_bench/manipulation_o3de/tasks/test_task.py (1)
task_prompt
(26-27)
src/rai_bench/rai_bench/examples/tool_calling_agent/main.py (4)
src/rai_bench/rai_bench/examples/tool_calling_agent/tasks.py (1)
get_all_tasks
(406-415)src/rai_bench/rai_bench/examples/manipulation_o3de/main.py (2)
parse_args
(44-61)run_benchmark
(64-197)src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (1)
set_logger
(492-495)src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (2)
ToolCallingAgentBenchmark
(43-193)run_next
(66-170)
tests/rai_bench/tool_calling_agent/test_subtasks.py (3)
src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (1)
info
(56-58)tests/rai_bench/tool_calling_agent/test_validators.py (1)
info
(68-69)src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (4)
info
(46-51)info
(85-91)info
(125-131)info
(165-171)
src/rai_bench/rai_bench/examples/manipulation_o3de/main.py (4)
src/rai_bench/rai_bench/manipulation_o3de/benchmark.py (1)
ManipulationO3DEBenchmark
(76-265)src/rai_bench/rai_bench/examples/tool_calling_agent/main.py (2)
parse_args
(27-50)run_benchmark
(53-92)src/rai_core/rai/agents/conversational_agent.py (2)
agent
(34-46)create_conversational_agent
(49-81)src/rai_bench/rai_bench/manipulation_o3de/interfaces.py (2)
system_prompt
(59-60)system_prompt
(381-389)
src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (3)
src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (1)
info
(56-58)tests/rai_bench/tool_calling_agent/test_subtasks.py (1)
info
(44-45)tests/rai_bench/tool_calling_agent/test_validators.py (1)
info
(68-69)
src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (5)
src/rai_bench/rai_bench/base_benchmark.py (6)
BaseBenchmark
(36-125)BenchmarkSummary
(24-33)csv_initialize
(68-82)run_next
(110-118)csv_writerow
(85-107)compute_and_save_summary
(121-123)src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (9)
Task
(457-581)info
(56-58)get_prompt
(554-562)get_tool_calls_from_messages
(505-513)validate
(51-52)validate
(453-454)validate
(564-581)dump_validators
(515-516)get_system_prompt
(543-551)src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py (4)
ScoreTracingHandler
(67-108)TaskResult
(42-64)get_callbacks
(75-76)send_score
(79-108)src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py (2)
SpatialReasoningAgentTask
(37-71)get_system_prompt
(70-71)src/rai_bench/rai_bench/manipulation_o3de/benchmark.py (2)
run_next
(149-229)compute_and_save_summary
(231-265)
src/rai_bench/rai_bench/tool_calling_agent/validators.py (4)
src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (8)
validate
(51-52)validate
(453-454)validate
(564-581)reset
(411-419)SubTaskValidationError
(28-29)add_subtask_errors
(403-409)required_calls
(400-401)required_calls
(535-540)tests/rai_bench/tool_calling_agent/test_subtasks.py (1)
validate
(37-38)tests/rai_bench/tool_calling_agent/test_validators.py (2)
validate
(47-65)ToolCall
(28-31)src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (4)
validate
(34-43)validate
(68-82)validate
(108-122)validate
(148-162)
src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (4)
src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py (2)
SubTaskResult
(29-32)ValidatorResult
(35-39)tests/rai_bench/tool_calling_agent/test_subtasks.py (1)
info
(44-45)tests/rai_bench/tool_calling_agent/test_validators.py (1)
info
(68-69)src/rai_bench/rai_bench/tool_calling_agent/tasks/subtasks.py (4)
info
(46-51)info
(85-91)info
(125-131)info
(165-171)
🔇 Additional comments (63)
src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py (3)
832-836
: Task hierarchy improvementGood addition of the
CustomInterfaceTask
base class with a type property. This creates a consistent approach to task type identification across the benchmarking framework.
838-838
: Updated inheritance hierarchyThe refactoring to use the new
CustomInterfaceTask
base class ensures consistent type identification and aligns with similar changes in other task modules.
871-871
: Updated inheritance hierarchyThe refactoring to use the new
CustomInterfaceTask
base class ensures consistent type identification and aligns with similar changes in other task modules.src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py (1)
499-502
: Task type property implementationGood addition of the
type
property to theNavigationTask
class. This standardized identification mechanism helps with task filtering, result tracking, and visualization as part of the broader benchmark infrastructure overhaul.src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py (1)
47-50
: Task type property implementationGood addition of the
type
property to theBasicTask
class. This standardized identification mechanism helps with task filtering, result tracking, and visualization as part of the broader benchmark infrastructure overhaul.src/rai_bench/rai_bench/results_processing/__init__.py (1)
1-13
: LGTM: New package marker for results processingThis file properly marks the new
results_processing
directory as a Python package. The Apache 2.0 license header is correctly included.src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py (1)
49-52
: LGTM: Consistent task type identificationThe addition of the
type
property aligns with the standardization of task type identification across the codebase, making it easier to categorize and filter tasks by their type.src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py (1)
55-58
: LGTM: Consistent task type identificationThe addition of the
type
property maintains consistency with other task classes, facilitating task classification and selection in the benchmark framework.src/rai_bench/rai_bench/manipulation_o3de/tasks/place_cubes_task.py (1)
48-50
: LGTM: Method converted to property for consistent interfaceConverting
get_prompt()
to thetask_prompt
property aligns with the interface standardization across task classes. This makes the API more consistent and follows Python conventions for read-only properties.src/rai_bench/rai_bench/manipulation_o3de/tasks/place_at_coord_task.py (1)
56-62
: Method converted to property to align with interface changes.The
get_prompt
method has been refactored into atask_prompt
property, maintaining the same functionality. This change aligns with the updated interface inManipulationTask
where prompt access is standardized across all tasks.src/rai_bench/rai_bench/manipulation_o3de/tasks/group_objects_task.py (1)
50-62
: Method converted to property to align with interface changes.The
get_prompt
method has been properly converted to atask_prompt
property while preserving the existing functionality. This refactoring provides a consistent interface across all task classes.src/rai_bench/rai_bench/manipulation_o3de/tasks/build_tower_task.py (1)
78-81
: Method converted to property to align with interface changes.The
get_prompt
method has been correctly refactored into atask_prompt
property while maintaining the same functionality. This aligns with the standardized prompt access pattern implemented across all task classes.src/rai_bench/rai_bench/manipulation_o3de/tasks/rotate_object_task.py (1)
52-59
: Method converted to property to align with interface changes.The
get_prompt
method has been appropriately refactored into atask_prompt
property. The implementation logic remains unchanged, continuing to generate a descriptive prompt about rotating objects to a target quaternion orientation.src/rai_bench/rai_bench/manipulation_o3de/tasks/move_object_to_left_task.py (1)
40-41
: Method refactored to property for consistent interfaceThe
get_prompt
method has been properly converted to a read-only propertytask_prompt
, maintaining the same functionality while aligning with the refactored interface used throughout the codebase. This change improves consistency with other task classes.src/rai_bench/rai_bench/examples/manipulation_o3de/scenarios.py (16)
21-21
: Import updated to use refactored class nameThe import statement has been updated to use
ManipulationO3DEBenchmark
instead ofBenchmark
, reflecting the class renaming refactoring that's part of the broader benchmark restructuring.
89-93
: Method call updated to use refactored class nameThe call to
create_scenarios
has been correctly updated to use the refactored class name.
102-106
: Method call updated to use refactored class nameThe call to
create_scenarios
has been correctly updated to use the refactored class name.
171-176
: Method call updated to use refactored class nameThe call to
create_scenarios
has been correctly updated to use the refactored class name.
191-195
: Method call updated to use refactored class nameThe call to
create_scenarios
has been correctly updated to use the refactored class name.
199-203
: Method call updated to use refactored class nameThe call to
create_scenarios
has been correctly updated to use the refactored class name.
287-292
: Method call updated to use refactored class nameThe call to
create_scenarios
has been correctly updated to use the refactored class name.
296-301
: Method call updated to use refactored class nameThe call to
create_scenarios
has been correctly updated to use the refactored class name.
313-317
: Method call updated to use refactored class nameThe call to
create_scenarios
has been correctly updated to use the refactored class name.
333-337
: Method call updated to use refactored class nameThe call to
create_scenarios
has been correctly updated to use the refactored class name.
421-425
: Method call updated to use refactored class nameThe call to
create_scenarios
has been correctly updated to use the refactored class name.
429-433
: Method call updated to use refactored class nameThe call to
create_scenarios
has been correctly updated to use the refactored class name.
445-450
: Method call updated to use refactored class nameThe call to
create_scenarios
has been correctly updated to use the refactored class name.
467-471
: Method call updated to use refactored class nameThe call to
create_scenarios
has been correctly updated to use the refactored class name.
537-542
: Method call updated to use refactored class nameThe call to
create_scenarios
has been correctly updated to use the refactored class name.
558-562
: Method call updated to use refactored class nameThe call to
create_scenarios
has been correctly updated to use the refactored class name.src/rai_bench/README.md (1)
185-210
: Well-documented results storage and visualization sectionThe new documentation section clearly explains the results storage structure and provides helpful instructions for visualizing benchmark results using Streamlit. This addition aligns well with the new benchmark results handling functionality.
tests/rai_bench/tool_calling_agent/test_subtasks.py (1)
43-45
: Addedinfo
property to align with interface changesThe implementation of the
info
property in the mock subtask correctly aligns with the interface changes in theSubTask
class, wheredump()
has been replaced with aninfo
property. This follows good practice by properly mocking the complete interface.tests/rai_bench/manipulation_o3de/tasks/test_task.py (2)
25-31
: Interface design updated to use properties instead of methodsThe task interface has been refactored to use properties for
task_prompt
andsystem_prompt
instead of methods. This change improves readability and aligns with the interface updates in the main benchmarking infrastructure.
36-36
: Method renamed for clarity and consistencyRenaming
calculate_result
tocalculate_score
improves clarity and maintains consistency with the naming conventions used in other benchmark components.src/rai_bench/rai_bench/manipulation_o3de/results_tracking.py (1)
18-33
: Well-structured result model using PydanticThe
ScenarioResult
class provides a clean, type-safe way to track benchmark results. The use of Pydantic with descriptive field metadata ensures data integrity and documentation.This model aligns with the broader effort to standardize result tracking across benchmarks, enabling consistent data processing and visualization in the new pipeline.
src/rai_bench/rai_bench/examples/test_models.py (3)
14-29
: Improved results organization with timestamped directoriesAdding timestamp-based directory structure for benchmark results enables better tracking and comparison of benchmark runs over time.
35-45
: Enhanced test structure with parameterized runsThe nested loop structure allows for running benchmarks with different extra tool call counts, creating a more organized directory structure for results comparison.
46-65
: Updated benchmark execution with proper error handlingBenchmark execution is now properly encapsulated with try-except blocks and detailed error reporting.
src/rai_bench/rai_bench/examples/tool_calling_agent/main.py (8)
23-23
: Improved task management with dynamic task generationImporting the
get_all_tasks
function instead of a predefined list allows for dynamic task configuration based on parameters like extra tool calls.
29-41
: Enhanced command-line interface with required argumentsMaking model name and vendor required arguments and adding support for extra tool calls improves the CLI usability.
42-48
: Improved output directory configurationAdding a configurable output directory with a sensible default timestamp-based path improves flexibility and organization.
53-55
: Updated benchmark function signature with new parametersThe function signature now includes
extra_tool_calls
and uses a more flexible output directory approach.
73-75
: Dynamic task generation with configurable extra tool callsUsing
get_all_tasks
with theextra_tool_calls
parameter enables flexible benchmark configuration.
77-82
: Updated benchmark initialization with structured results directoryThe benchmark is now initialized with explicit model name and results directory parameters, aligning with the new standardized results tracking approach.
92-92
: Simplified benchmark run method callRemoving the redundant model_name parameter from the run_next call improves the code cleanliness.
97-102
: Updated main function with new parameter passingThe main function now correctly passes all required parameters to the run_benchmark function.
src/rai_bench/rai_bench/examples/manipulation_o3de/main.py (10)
38-38
: Import renamed to use specific benchmark class.The import is updated to use
ManipulationO3DEBenchmark
instead of the genericBenchmark
, which aligns with the PR's goal of unifying benchmark implementations.
54-60
: Good addition of standardized output directory parameter.Adding the
--out_dir
parameter with a default timestamped directory follows the pattern established in other benchmarks and supports the PR's goal of consistent result storage across all benchmarks.
64-64
: Function signature updated for output directory handling.The
run_benchmark
function now accepts theout_dir
parameter, allowing for consistent output directory handling across the application.
89-91
: Updated directory creation for consistent result storage.Using the passed
out_dir
parameter to create the output directory and log file, rather than generating a new path internally, provides a more consistent approach to result storage.
155-166
: Scenarios limited to trivial ones during refactoring.The commenting out of more complex scenarios (easy, medium, hard, very hard) simplifies testing during this refactoring phase. However, ensure these are re-enabled for comprehensive testing once the refactoring is stable.
Is limiting to only trivial scenarios permanent or just for the initial testing of the benchmark infrastructure? Consider adding a TODO comment if these should be re-enabled later.
168-168
: All scenarios simplified to only trivial ones.This change aligns with the commented-out scenario imports, focusing only on trivial scenarios for initial testing of the benchmarking infrastructure.
172-178
: Updated benchmark initialization with result directory path.The benchmark is now initialized with
ManipulationO3DEBenchmark
and passes theout_dir
as theresults_dir
parameter, aligning with the newBaseBenchmark
inheritance structure that standardizes result handling.
179-179
: Improved iteration by using direct scenario objects.Changed from index-based iteration to direct iteration over scenario objects, which is a more Pythonic approach and improves code readability.
181-182
: Improved system prompt handling with task-specific prompts.Instead of using a fixed, hardcoded system prompt, the code now uses each scenario's task-specific system prompt, making the code more modular and flexible.
202-202
: Main entry point updated to use new parameter.The
out_dir
parameter is now properly passed to therun_benchmark
function, completing the parameter flow from command line to execution.src/rai_bench/rai_bench/manipulation_o3de/interfaces.py (3)
45-45
: Improved method naming withcalculate_score
instead ofcalculate_result
.The renaming from
calculate_result
tocalculate_score
better reflects the method's purpose - calculating a normalized score between 0.0 and 1.0 rather than a general result. This improves code clarity and follows the standardized scoring terminology established across benchmarks.Also applies to: 82-82, 492-492
57-65
: Good separation of system prompt and task prompt.Splitting the single
get_prompt
method into two separate properties (system_prompt
andtask_prompt
) follows best practices for LLM prompt engineering. This separation allows for:
- More modular prompt management
- Consistent system instructions across different tasks
- Task-specific prompts that focus only on the specific instructions
This design improvement aligns with modern LLM agent architectures and enables more flexible prompt handling.
380-389
: Well-defined system prompt implementation.The implementation of the
system_prompt
property in theManipulationTask
class provides clear, detailed instructions for the robotic arm, including:
- Description of the agent's role
- Explanation of the coordinate system
- Important initial actions to take
Moving this from a hardcoded string in the main script to the task class itself follows the single responsibility principle and improves maintainability.
src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py (1)
29-65
: Excellent structured result tracking with Pydantic models.The introduction of Pydantic models for result tracking (
SubTaskResult
,ValidatorResult
,TaskResult
) is a significant improvement that:
- Enforces data validation through type checking
- Provides clear documentation with field descriptions
- Creates a well-structured hierarchy mirroring the benchmark's logical structure
- Enables consistent serialization/deserialization for results storage
- Supports the visualization and analysis goals mentioned in the PR
The hierarchical organization (Task > Validator > SubTask) accurately reflects the benchmark structure and will make result analysis more straightforward. The comprehensive field descriptions ensure that the purpose of each attribute is clear.
This is a high-quality implementation that modernizes the result tracking infrastructure.
src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (1)
169-170
: Summary CSV is truncated every iteration
compute_and_save_summary()
callscsv_initialize
, which opens the summary file with mode"w"
.
Running it after each task means only the latest summary row is kept.If your intent is “one row per incremental snapshot” then switch
csv_initialize
to append-if-exists (see earlier comment) or call it only once at startup.src/rai_bench/rai_bench/results_processing/data_processing.py (1)
82-95
:❌ Incorrect review comment
Potential attribute mismatch:
extra_tool_calls
vsextra_tool_calls_used
Filtering uses
r.extra_tool_calls
, aggregation usesr.extra_tool_calls_used
.If
TaskResult
exposes only one of those attributes the code will raiseAttributeError
or silently filter nothing.- filtered_results = [ - r for r in all_detailed_results if r.extra_tool_calls == extra_calls - ] + filtered_results = [ + r for r in all_detailed_results if r.extra_tool_calls_used == extra_calls + ]Please verify the canonical field name in
TaskResult
and use it consistently throughout this module (also lines 268-270).
🏁 Script executed:
#!/usr/bin/env bash set -e # Locate TaskResult class definition rg -n "class TaskResult" . # Extract the file defining TaskResult TASK_RESULT_FILE=$(rg -l "class TaskResult" .) # Show occurrences of both attributes in that file rg -n "extra_tool_calls" "$TASK_RESULT_FILE" rg -n "extra_tool_calls_used" "$TASK_RESULT_FILE" # Show all references of both attributes across the codebase rg -n "extra_tool_calls" . rg -n "extra_tool_calls_used" .Length of output: 743
🏁 Script executed:
#!/usr/bin/env bash set -e # Files where TaskResult is defined FILES=( "src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py" "src/rai_bench/rai_bench/tool_calling_agent/messages/actions.py" ) for file in "${FILES[@]}"; do echo "=== File: $file ===" rg -n "extra_tool_calls" "$file" || echo "no matches for extra_tool_calls" rg -n "extra_tool_calls_used" "$file" || echo "no matches for extra_tool_calls_used" done echo "=== Repository-wide occurrences ===" rg -n "extra_tool_calls" . || true rg -n "extra_tool_calls_used" . || trueLength of output: 20101
No field mismatch here – both attributes are defined and used correctly
The
TaskResult
model in
src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py
defines both
extra_tool_calls
(the allotted number of calls) andextra_tool_calls_used
(the actual number used)The data‐processing logic intentionally
- filters by
r.extra_tool_calls
(group by budget)- computes the average of
r.extra_tool_calls_used
(measure actual usage)No change needed; you can safely ignore this suggestion.
Likely an incorrect or invalid review comment.
src/rai_bench/rai_bench/results_processing/visualise.py (1)
319-325
: Y-axis range inconsistent with earlier success-rate chartsEarlier charts treat
avg_success_rate
as percentage (0–100) while this one clamps the
axis to 0–1. Please confirm which unit is intended and keep it consistent across all
plots to avoid misinterpretation.- y_range=(0.0, 1.0), + # If success_rate is a percentage, use 0–100; otherwise adjust upstream + y_range=(0.0, 100.0),
"--extra-tool-calls", | ||
type=str, | ||
help="Number of extra tools calls agent can make and still pass the task", | ||
default=0, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type mismatch in extra-tool-calls argument
The extra-tool-calls
argument is defined as a string but used as an integer later in the code.
Fix the type to avoid conversion issues:
- type=str,
+ type=int,
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
"--extra-tool-calls", | |
type=str, | |
help="Number of extra tools calls agent can make and still pass the task", | |
default=0, | |
) | |
"--extra-tool-calls", | |
- type=str, | |
+ type=int, | |
help="Number of extra tools calls agent can make and still pass the task", | |
default=0, | |
) |
def __init__( | ||
self, | ||
model_name: str, | ||
results_dir: Path, | ||
logger: logging.Logger | None = None, | ||
) -> None: | ||
"""Initialize the base benchmark. | ||
|
||
Parameters | ||
---------- | ||
model_name : str | ||
Name of the LLM model. | ||
logger : Optional[loggers_type] | ||
Logger instance. | ||
results_filename : str | ||
Path to the results file. | ||
summary_filename : Optional[str] | ||
Path to the summary file. | ||
""" | ||
self.model_name = model_name | ||
self.results_filename = results_dir / "results.csv" | ||
self.summary_filename = results_dir / "results_summary.csv" | ||
|
||
if logger: | ||
self.logger = logger | ||
else: | ||
self.logger = logging.getLogger(__name__) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Create results_dir
early & fix outdated docstring parameters
results_dir
may not exist yet – any attempt to open results.csv
/results_summary.csv
for writing will raise FileNotFoundError
.
Also, the docstring still refers to the long-gone results_filename
/ summary_filename
parameters, which is misleading.
@@
- self.model_name = model_name
- self.results_filename = results_dir / "results.csv"
- self.summary_filename = results_dir / "results_summary.csv"
+ self.model_name = model_name
+
+ # Ensure the directory tree exists before we start opening files
+ results_dir.mkdir(parents=True, exist_ok=True)
+
+ self.results_filename = results_dir / "results.csv"
+ self.summary_filename = results_dir / "results_summary.csv"
Please also update the parameter list in the docstring to match the actual signature.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def __init__( | |
self, | |
model_name: str, | |
results_dir: Path, | |
logger: logging.Logger | None = None, | |
) -> None: | |
"""Initialize the base benchmark. | |
Parameters | |
---------- | |
model_name : str | |
Name of the LLM model. | |
logger : Optional[loggers_type] | |
Logger instance. | |
results_filename : str | |
Path to the results file. | |
summary_filename : Optional[str] | |
Path to the summary file. | |
""" | |
self.model_name = model_name | |
self.results_filename = results_dir / "results.csv" | |
self.summary_filename = results_dir / "results_summary.csv" | |
if logger: | |
self.logger = logger | |
else: | |
self.logger = logging.getLogger(__name__) | |
def __init__( | |
self, | |
model_name: str, | |
results_dir: Path, | |
logger: logging.Logger | None = None, | |
) -> None: | |
"""Initialize the base benchmark. | |
Parameters | |
---------- | |
model_name : str | |
Name of the LLM model. | |
logger : Optional[loggers_type] | |
Logger instance. | |
results_filename : str | |
Path to the results file. | |
summary_filename : Optional[str] | |
Path to the summary file. | |
""" | |
self.model_name = model_name | |
# Ensure the directory tree exists before we start opening files | |
results_dir.mkdir(parents=True, exist_ok=True) | |
self.results_filename = results_dir / "results.csv" | |
self.summary_filename = results_dir / "results_summary.csv" | |
if logger: | |
self.logger = logger | |
else: | |
self.logger = logging.getLogger(__name__) |
def compute_and_save_summary(self) -> None: | ||
"""Compute summary statistics and save them to the summary file.""" | ||
self.logger.info("Computing and saving average results...") | ||
|
||
success_count = sum(1 for r in self.scenario_results if r.score == 1.0) | ||
success_rate = ( | ||
success_count / len(self.scenario_results) * 100 | ||
if self.scenario_results | ||
else 0 | ||
) | ||
avg_time = ( | ||
statistics.mean(r.total_time for r in self.scenario_results) | ||
if self.scenario_results | ||
else 0 | ||
) | ||
|
||
# TODO (jm) extend this bechmark to implement extra tool calls | ||
# since this benchmark doesn't have the concept of "extra tool calls", | ||
# we use the total number of tool calls instead | ||
total_tool_calls = sum(r.number_of_tool_calls for r in self.scenario_results) | ||
|
||
summary = BenchmarkSummary( | ||
model_name=self.model_name, | ||
success_rate=round(success_rate, 2), | ||
avg_time=round(avg_time, 3), | ||
total_extra_tool_calls_used=total_tool_calls, | ||
total_tasks=len(self.scenario_results), | ||
) | ||
self.csv_initialize(self.summary_filename, BenchmarkSummary) | ||
self.csv_writerow(self.summary_filename, summary) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Repeated csv_initialize()
may overwrite the summary file on every scenario run
compute_and_save_summary()
invokes self.csv_initialize(self.summary_filename, BenchmarkSummary)
each time it is called.
If the helper recreates the file (or truncates an existing one) this will continuously drop the previously-written summary rows, resulting in a CSV that finally contains only the last scenario’s aggregate, not the whole benchmark history.
- self.csv_initialize(self.summary_filename, BenchmarkSummary)
- self.csv_writerow(self.summary_filename, summary)
+ # Initialise once (e.g. in __init__) – just append here
+ self.csv_writerow(self.summary_filename, summary)
Consider moving the initialise call to __init__
, or protect it with an if not self.summary_filename.exists()
guard inside a helper.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def compute_and_save_summary(self) -> None: | |
"""Compute summary statistics and save them to the summary file.""" | |
self.logger.info("Computing and saving average results...") | |
success_count = sum(1 for r in self.scenario_results if r.score == 1.0) | |
success_rate = ( | |
success_count / len(self.scenario_results) * 100 | |
if self.scenario_results | |
else 0 | |
) | |
avg_time = ( | |
statistics.mean(r.total_time for r in self.scenario_results) | |
if self.scenario_results | |
else 0 | |
) | |
# TODO (jm) extend this bechmark to implement extra tool calls | |
# since this benchmark doesn't have the concept of "extra tool calls", | |
# we use the total number of tool calls instead | |
total_tool_calls = sum(r.number_of_tool_calls for r in self.scenario_results) | |
summary = BenchmarkSummary( | |
model_name=self.model_name, | |
success_rate=round(success_rate, 2), | |
avg_time=round(avg_time, 3), | |
total_extra_tool_calls_used=total_tool_calls, | |
total_tasks=len(self.scenario_results), | |
) | |
self.csv_initialize(self.summary_filename, BenchmarkSummary) | |
self.csv_writerow(self.summary_filename, summary) | |
def compute_and_save_summary(self) -> None: | |
"""Compute summary statistics and save them to the summary file.""" | |
self.logger.info("Computing and saving average results...") | |
success_count = sum(1 for r in self.scenario_results if r.score == 1.0) | |
success_rate = ( | |
success_count / len(self.scenario_results) * 100 | |
if self.scenario_results | |
else 0 | |
) | |
avg_time = ( | |
statistics.mean(r.total_time for r in self.scenario_results) | |
if self.scenario_results | |
else 0 | |
) | |
# TODO (jm) extend this bechmark to implement extra tool calls | |
# since this benchmark doesn't have the concept of "extra tool calls", | |
# we use the total number of tool calls instead | |
total_tool_calls = sum(r.number_of_tool_calls for r in self.scenario_results) | |
summary = BenchmarkSummary( | |
model_name=self.model_name, | |
success_rate=round(success_rate, 2), | |
avg_time=round(avg_time, 3), | |
total_extra_tool_calls_used=total_tool_calls, | |
total_tasks=len(self.scenario_results), | |
) | |
# Initialise once (e.g. in __init__) – just append here | |
self.csv_writerow(self.summary_filename, summary) |
def validate(self, tool_call: Dict[str, Any]) -> bool: | ||
if self.specific_tool and tool_call.name != self.specific_tool: | ||
raise SubTaskValidationError( | ||
f"Expected tool {self.specific_tool}, got {tool_call.name}" | ||
) | ||
|
||
if self._outcomes is not None: | ||
try: | ||
should_pass = next(self._outcomes) | ||
except StopIteration: | ||
# if run out, default to True | ||
should_pass = True | ||
else: | ||
should_pass = True | ||
|
||
if not should_pass: | ||
raise SubTaskValidationError(f"error in {self.name}") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tool_call
type annotation is misleading – causes mypy/IDE noise
DummySubTask.validate
is declared as:
def validate(self, tool_call: Dict[str, Any]) -> bool:
but every test passes a ToolCall
instance whose .name
attribute is accessed.
Changing the annotation to Any | "ToolCall"
(or defining a proper protocol) will remove false-positive type errors.
- def validate(self, tool_call: Dict[str, Any]) -> bool:
+ from typing import Protocol
+
+ class _HasName(Protocol):
+ name: str
+
+ def validate(self, tool_call: _HasName) -> bool: # noqa: D401
BECNMARKS_CONVERTERS: Dict[str, Any] = { | ||
"tool_calling_agent": convert_row_to_task_result, | ||
"manipulation_o3de": convert_row_to_scenario_result, | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo – constant name
BECNMARKS_CONVERTERS
contains a transposed “N” and “H”. Typo-prone names hurt discoverability.
-BECNMARKS_CONVERTERS: Dict[str, Any] = {
+BENCHMARK_CONVERTERS: Dict[str, Any] = {
Remember to update all call-sites (load_detailed_data
, etc.).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
BECNMARKS_CONVERTERS: Dict[str, Any] = { | |
"tool_calling_agent": convert_row_to_task_result, | |
"manipulation_o3de": convert_row_to_scenario_result, | |
} | |
BENCHMARK_CONVERTERS: Dict[str, Any] = { | |
"tool_calling_agent": convert_row_to_task_result, | |
"manipulation_o3de": convert_row_to_scenario_result, | |
} |
def load_detailed_data(file_path: str, benchmark: str) -> List[TaskResult]: | ||
df = pd.read_csv(file_path) # type: ignore | ||
task_results: List[TaskResult] = [] | ||
|
||
converter = BECNMARKS_CONVERTERS[benchmark] | ||
for _, row in df.iterrows(): # type: ignore | ||
task_result = converter(row) | ||
task_results.append(task_result) | ||
|
||
return task_results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Return type is incorrect when loading non-TCA benchmarks
load_detailed_data()
is annotated to return List[TaskResult]
but for "manipulation_o3de"
it actually returns ScenarioResult
objects. This mis-specification will surface in static analysis and IDEs.
-from typing import Any, Dict, List, Optional, Tuple
+from typing import Any, Dict, List, Optional, Tuple, Union
...
-def load_detailed_data(...) -> List[TaskResult]:
+def load_detailed_data(...) -> List[Union[TaskResult, ScenarioResult]]:
def render_manipulation_o3de(bench_results: BenchmarkResults): ... | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
render_manipulation_o3de
is a no-op – UI will appear blank
The stub returns Ellipsis
, so selecting the manipulation_o3de benchmark
renders nothing but also gives users no feedback.
-def render_manipulation_o3de(bench_results: BenchmarkResults): ...
+def render_manipulation_o3de(bench_results: BenchmarkResults):
+ st.header("Manipulation O3DE Benchmark")
+ st.info("Detailed visualisations for this benchmark are not implemented yet.")
Until proper charts are added, showing an explanatory message prevents confusion.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def render_manipulation_o3de(bench_results: BenchmarkResults): ... | |
def render_manipulation_o3de(bench_results: BenchmarkResults): | |
st.header("Manipulation O3DE Benchmark") | |
st.info("Detailed visualisations for this benchmark are not implemented yet.") |
Purpose
Update what and how is saved as results from benchmarks.
Create script for visualizing these results in a meaningful way
Proposed Changes
src/rai_bench/rai_bench/examples/test_models.py
will now produce seperate folder for every run, structured like:└── run_2023-04-30_12-36-49/
├── <bench_name_1>/
│ ├── <model_name_1>/
│ │ ├── benchmark.log
│ │ ├── results_summary.csv
│ │ └── results.csv
│ └── <model_name_2>/
│ ├── benchmark.log
│ ├── results_summary.csv
│ └── results.csv
└── <bench_name_2>/
├── <model_name_1>/
│ ├── benchmark.log
│ ├── results_summary.csv
│ └── results.csv
└── <model_name_2>/
├── benchmark.log
├── results_summary.csv
└── results.csv
rai_bench/results_processing
Issues
#530
#531
Testing
get some results:
then run:
Summary by CodeRabbit
New Features
Enhancements
Bug Fixes
Refactor
Chores
Tests