Add multi-step tool-calling SDG tutorial for workplace assistant#327
Add multi-step tool-calling SDG tutorial for workplace assistant#327shashank3959 wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
Conversation
Add notebook, tool definitions, and utility modules for generating synthetic multi-step tool-calling training data using Data Designer. Includes dual-level LLM judge filtering and NeMo Gym export. Signed-off-by: Shashank Verma <shashankv@nvidia.com>
|
Thank you for your submission! We ask that you sign our Developer Certificate of Origin before we can accept your contribution. You can sign the DCO by adding a comment below using this text: I have read the DCO document and I hereby sign the DCO. You can retrigger this bot by commenting recheck in this Pull Request. Posted by the DCO Assistant Lite bot. |
Greptile OverviewGreptile SummaryAdds comprehensive multi-step tool-calling synthetic data generation tutorial using Data Designer. Implements a complete pipeline for generating realistic workplace assistant queries with simulated agent trajectories, dual-level LLM judge filtering for quality control, and NeMo Gym export format compatibility. Key Components:
Style Issue:
|
| Filename | Overview |
|---|---|
| docs/colab_notebooks/5-multistep-toolcalling/multistep-toolcalling.ipynb | Comprehensive tutorial notebook for multi-step tool-calling SDG with clear examples and dual-level quality filtering |
| docs/colab_notebooks/5-multistep-toolcalling/utils/init.py | Package initialization - missing NVIDIA license headers required per AGENTS.md |
| docs/colab_notebooks/5-multistep-toolcalling/utils/convert_to_nemo_gym_format.py | NeMo Gym format converter with proper type hints - missing NVIDIA license headers required per AGENTS.md |
| docs/colab_notebooks/5-multistep-toolcalling/utils/quality_filtering.py | Quality filtering utilities with dual-level validation - missing NVIDIA license headers required per AGENTS.md |
| docs/colab_notebooks/5-multistep-toolcalling/tools/environment.json | Environment configuration with 27 multi-step patterns covering all tool combinations |
Sequence Diagram
sequenceDiagram
participant User
participant DataDesigner
participant LLM
participant QualityFilter
participant NeMoGym
User->>DataDesigner: Load tool schemas & seed data
DataDesigner->>LLM: Generate user query from pattern
LLM-->>DataDesigner: Return user query
DataDesigner->>LLM: Judge user query (feasibility, schema compliance)
LLM-->>DataDesigner: Return query scores
DataDesigner->>LLM: Generate trajectory (tool calls)
LLM-->>DataDesigner: Return agent trajectory
DataDesigner->>LLM: Judge trajectory (tool validity, completeness)
LLM-->>DataDesigner: Return trajectory scores
DataDesigner->>QualityFilter: Filter by dual-level scores
QualityFilter->>QualityFilter: Stage 1: Validate query
QualityFilter->>QualityFilter: Stage 2: Validate trajectory
QualityFilter-->>User: Return filtered dataset
User->>NeMoGym: Convert to NeMo Gym JSONL format
NeMoGym-->>User: Training data ready for RL
Last reviewed commit: eb2b52b
…olcalling.ipynb Signed-off-by: Shashank Verma <shashankv@nvidia.com>
- quality_filtering.py: remove FilterThresholds dataclass, quickstart print, and verbose ASCII output; de-duplicate show_rejection_reasons (285 → 85 lines) - convert_to_nemo_gym_format.py: remove factory pattern and quickstart print (108 → 76 lines) - __init__.py: export only 4 functions (24 → 10 lines) - Notebook: fix imports to use data_designer.config/interface instead of removed essentials module; use functools.partial for converter Signed-off-by: Shashank Verma <shashankv@nvidia.com>
70ab957 to
eb2b52b
Compare
| @@ -0,0 +1,9 @@ | |||
| from .quality_filtering import filter_high_quality, show_rejection_reasons | |||
There was a problem hiding this comment.
Missing NVIDIA license headers. Per AGENTS.md, all Python files must include:
| from .quality_filtering import filter_high_quality, show_rejection_reasons | |
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| from .quality_filtering import filter_high_quality, show_rejection_reasons |
Context Used: Context from dashboard - AGENTS.md (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/colab_notebooks/5-multistep-toolcalling/utils/__init__.py
Line: 1:1
Comment:
Missing NVIDIA license headers. Per AGENTS.md, all Python files must include:
```suggestion
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
from .quality_filtering import filter_high_quality, show_rejection_reasons
```
**Context Used:** Context from `dashboard` - AGENTS.md ([source](https://app.greptile.com/review/custom-context?memory=95f8243f-5118-40bc-a3ab-69210b72e57e))
How can I resolve this? If you propose a fix, please make it concise.| @@ -0,0 +1,75 @@ | |||
| """Utilities for converting generated records to NeMo Gym JSONL format.""" | |||
There was a problem hiding this comment.
Missing NVIDIA license headers. Per AGENTS.md, all Python files must include:
| """Utilities for converting generated records to NeMo Gym JSONL format.""" | |
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| """Utilities for converting generated records to NeMo Gym JSONL format.""" |
Context Used: Context from dashboard - AGENTS.md (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/colab_notebooks/5-multistep-toolcalling/utils/convert_to_nemo_gym_format.py
Line: 1:1
Comment:
Missing NVIDIA license headers. Per AGENTS.md, all Python files must include:
```suggestion
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""Utilities for converting generated records to NeMo Gym JSONL format."""
```
**Context Used:** Context from `dashboard` - AGENTS.md ([source](https://app.greptile.com/review/custom-context?memory=95f8243f-5118-40bc-a3ab-69210b72e57e))
How can I resolve this? If you propose a fix, please make it concise.| @@ -0,0 +1,86 @@ | |||
| """Utilities for dual-level quality filtering of generated datasets.""" | |||
There was a problem hiding this comment.
Missing NVIDIA license headers. Per AGENTS.md, all Python files must include:
| """Utilities for dual-level quality filtering of generated datasets.""" | |
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| """Utilities for dual-level quality filtering of generated datasets.""" |
Context Used: Context from dashboard - AGENTS.md (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/colab_notebooks/5-multistep-toolcalling/utils/quality_filtering.py
Line: 1:1
Comment:
Missing NVIDIA license headers. Per AGENTS.md, all Python files must include:
```suggestion
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""Utilities for dual-level quality filtering of generated datasets."""
```
**Context Used:** Context from `dashboard` - AGENTS.md ([source](https://app.greptile.com/review/custom-context?memory=95f8243f-5118-40bc-a3ab-69210b72e57e))
How can I resolve this? If you propose a fix, please make it concise.
Add notebook, tool definitions, and utility modules for generating synthetic multi-step tool-calling training data using Data Designer. Includes dual-level LLM judge filtering and NeMo Gym export.