Video tutorial improve by suiyoubi · Pull Request #1367 · NVIDIA-NeMo/Curator

suiyoubi · 2026-01-13T14:18:08Z

Description

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Praateek <praateekm@gmail.com>

Signed-off-by: Ao Tang <aot@nvidia.com>

copy-pr-bot · 2026-01-13T14:18:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

suiyoubi · 2026-01-13T14:18:30Z

/ok to test c8f7aa9

greptile-apps · 2026-01-13T14:22:53Z

Greptile Overview

Greptile Summary

This PR significantly improves video tutorial documentation by expanding the deduplication guide, removing deprecated resource fields (nvdecs/nvencs), and fixing broken cross-reference labels throughout the documentation.

Major changes:

Expanded docs/curate-video/process-data/dedup.md with comprehensive SemanticDeduplicationWorkflow examples, parameter documentation, and practical tips for eps selection
Removed deprecated nvdecs/nvencs fields from Resources class documentation across multiple files
Fixed broken cross-reference labels (e.g., reference-infrastructure-container-environments-main → reference-infrastructure-container-environments)
Improved formatting and organization in video loading documentation

Known issues (already flagged in previous review threads):

Line 74 in dedup.md: Incorrect API usage - workflow.run(executor) passes executor as kmeans_executor, but the workflow requires kmeans_executor to be a RayActorPoolExecutor. Should use workflow.run() with defaults or workflow.run(pairwise_executor=executor).
Line 306 in dedup.md: "Removing Duplicates" section ends incomplete with "The removal process depends on how you want to persist and shard your data:" but provides no follow-up content.

Confidence Score: 4/5

Safe to merge after addressing the two known documentation issues in dedup.md
Score reflects that this is primarily a documentation improvement PR with good changes (removing deprecated fields, fixing cross-references, expanding tutorials), but contains two issues in the main dedup.md file that have been identified in previous review threads and should be fixed before merge
docs/curate-video/process-data/dedup.md requires attention for API usage correction and completing the final section

Important Files Changed

Filename	Overview
docs/curate-video/process-data/dedup.md	Major expansion of dedup documentation with workflow examples, but contains incorrect API usage and incomplete section at end
docs/curate-video/load-data/index.md	Improved formatting and structure of video loading documentation with better organization
.github/copilot-instructions.md	Removed deprecated nvdecs/nvencs resource fields from Resources class documentation
api-design.md	Removed deprecated nvdecs/nvencs resource fields from Resources class specification

Sequence Diagram

sequenceDiagram
    participant User
    participant Docs as Documentation
    participant Workflow as SemanticDeduplicationWorkflow
    participant KMeans as KMeansStage
    participant Pairwise as PairwiseStage
    participant Identify as IdentifyDuplicatesStage
    
    Note over User,Docs: This PR improves video tutorial documentation
    
    User->>Docs: Read dedup documentation
    Note over Docs: Added comprehensive workflow examples
    Note over Docs: Expanded parameter documentation
    Note over Docs: Added tips for eps selection
    
    Note over User,Identify: Example workflow in updated docs
    User->>Workflow: Create SemanticDeduplicationWorkflow
    User->>Workflow: workflow.run(executor)
    Note right of Workflow: ⚠️ ISSUE: Incorrect API usage<br/>Should use named parameters:<br/>kmeans_executor, pairwise_executor
    
    Workflow->>KMeans: Run K-means clustering
    KMeans-->>Workflow: Clustered embeddings
    Workflow->>Pairwise: Compute pairwise similarity
    Pairwise-->>Workflow: Similarity scores
    Workflow->>Identify: Identify duplicates
    Identify-->>Workflow: Duplicate IDs
    Workflow-->>User: Results
    
    Note over User,Docs: Documentation also updates:
    Note over Docs: Removed nvdecs/nvencs from Resources
    Note over Docs: Fixed cross-reference labels
    Note over Docs: Improved formatting and organization

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T14:22:57Z

docs/curate-video/process-data/dedup.md

+# Run with XennaExecutor (GPU-accelerated)
+executor = XennaExecutor()
+results = workflow.run(executor)


Incorrect API usage: The workflow.run() method signature is run(self, kmeans_executor: BaseExecutor | None = None, pairwise_executor: BaseExecutor | None = None), but this example passes a single executor parameter.

When you call workflow.run(executor), the executor is interpreted as kmeans_executor. However, the SemanticDeduplicationWorkflow requires kmeans_executor to be a RayActorPoolExecutor (see line 343-345 of the workflow.py implementation). Passing XennaExecutor() will cause a runtime error: "kmeans_executor must be an instance of RayActorPoolExecutor."

Fix: Update the example to correctly specify both executors or rely on the defaults:

Suggested change

# Run with XennaExecutor (GPU-accelerated)

executor = XennaExecutor()

results = workflow.run(executor)

# Run the workflow (uses default executors: RayActorPoolExecutor for K-means, XennaExecutor for pairwise)

results = workflow.run()

# Or explicitly specify executors:

# from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor

# results = workflow.run(kmeans_executor=RayActorPoolExecutor(), pairwise_executor=XennaExecutor())

Alternatively, if you want to keep showing an explicit executor call, the documentation should clarify the correct parameter names and requirements.

greptile-apps · 2026-01-13T14:22:58Z

docs/curate-video/process-data/dedup.md

+
+The duplicate identification stages (`IdentifyDuplicatesStage` or `SemanticDeduplicationWorkflow` with `eps` specified) write Parquet files containing duplicate clip IDs to the output directory (typically `output_path/duplicates/`). These files contain a single column `id` with the IDs of clips that should be removed.
+
+**It is your responsibility to exclude these duplicate IDs when exporting or persisting your final dataset.** The removal process depends on how you want to persist and shard your data:


Incomplete section: The "Removing Duplicates" section ends with a colon but provides no follow-up content. The sentence reads: "The removal process depends on how you want to persist and shard your data:" but nothing follows.

This appears to be incomplete documentation that should explain how users can actually filter out the duplicate IDs when exporting their final dataset. Consider adding concrete examples or instructions for common export scenarios (e.g., using pandas/dask to filter parquet files, filtering during WebDataset creation, etc.).

greptile-apps · 2026-01-13T14:26:16Z

Greptile Overview

Greptile Summary

Summary

This PR improves video curation documentation by expanding the semantic deduplication guide and updating various cross-references.

Key Changes:

Semantic Deduplication (dedup.md) - Major expansion:
- Added new SemanticDeduplicationWorkflow end-to-end example with comprehensive parameter documentation
- Expanded "Individual Stages" example with detailed pipeline construction
- Added two helpful tips: one for determining the eps parameter through exploration, and another for custom ranking with metadata columns
- Completed previously truncated parameter tables for all stages (KMeansStage, PairwiseStage, IdentifyDuplicatesStage, SemanticDeduplicationWorkflow)
- Changed title from "Duplicate Removal" to "Duplicate Identification" (more accurate, as these stages only identify duplicates)
- Added new "Removing Duplicates" section clarifying user responsibility for actual removal
Reference Link Updates:
- Fixed reference in index.md from custom stage tutorial to execution backends documentation (more relevant to the context)
Documentation Restructuring:
- Reorganized "How it Works" section in load-data/index.md for improved readability
- Reordered sentences in frame-extraction.md for better logical flow

Issue Found:

The new SemanticDeduplicationWorkflow example contains a critical bug where workflow.run(executor) passes the executor to the wrong parameter position, which will cause a runtime error. The workflow requires RayActorPoolExecutor for k-means but the example passes XennaExecutor as the first positional argument.

Confidence Score: 3/5

This PR is mostly safe but contains one critical documentation bug that will cause runtime errors for users following the example code
The PR makes valuable improvements to documentation structure and adds comprehensive new content for semantic deduplication. However, the main "Single Step Workflow" example in dedup.md has a critical bug where the executor is passed incorrectly to workflow.run(), which will cause a TypeError at runtime when users try to follow the example. The rest of the changes (reference updates, restructuring) are solid improvements with no issues. Score of 3 reflects that most changes are good, but the critical bug in the most prominent example needs to be fixed before merge.
docs/curate-video/process-data/dedup.md requires attention - the SemanticDeduplicationWorkflow example needs correction before users encounter runtime errors

Important Files Changed

File Analysis

Filename	Score	Overview
docs/curate-video/index.md	5/5	Updated reference link from custom stage tutorial to execution backends documentation - correct and improves navigation
docs/curate-video/load-data/index.md	5/5	Restructured "How it Works" section for better readability, removed redundant subsections - improves documentation clarity
docs/curate-video/process-data/frame-extraction.md	5/5	Reordered sentences in "Before You Start" section for better logical flow - minor improvement
docs/curate-video/process-data/dedup.md	2/5	Major expansion with new SemanticDeduplicationWorkflow examples and parameter tables, but contains critical bug in workflow.run() usage that will cause runtime error

Sequence Diagram

sequenceDiagram
    participant User
    participant SemanticDeduplicationWorkflow
    participant KMeansStage
    participant PairwiseStage
    participant IdentifyDuplicatesStage
    participant RayActorPoolExecutor
    participant XennaExecutor

    User->>SemanticDeduplicationWorkflow: Initialize with input_path, n_clusters, eps, etc.
    User->>SemanticDeduplicationWorkflow: run(kmeans_executor, pairwise_executor)
    
    Note over SemanticDeduplicationWorkflow: Setup directories<br/>(kmeans_results, pairwise_results, duplicates)
    
    SemanticDeduplicationWorkflow->>KMeansStage: Create stage with n_clusters, embedding_dim, etc.
    SemanticDeduplicationWorkflow->>RayActorPoolExecutor: execute(KMeansStage)
    RayActorPoolExecutor->>KMeansStage: Process embeddings
    KMeansStage-->>SemanticDeduplicationWorkflow: Clustered embeddings with centroid distances
    
    SemanticDeduplicationWorkflow->>PairwiseStage: Create stage with ranking_strategy, batch_size
    SemanticDeduplicationWorkflow->>XennaExecutor: execute(PairwiseStage)
    XennaExecutor->>PairwiseStage: Compute within-cluster similarity
    PairwiseStage-->>SemanticDeduplicationWorkflow: Pairwise similarity scores
    
    alt eps is specified
        SemanticDeduplicationWorkflow->>IdentifyDuplicatesStage: Create stage with eps threshold
        SemanticDeduplicationWorkflow->>XennaExecutor: execute(IdentifyDuplicatesStage)
        XennaExecutor->>IdentifyDuplicatesStage: Filter pairs where cosine_sim >= 1.0 - eps
        IdentifyDuplicatesStage-->>SemanticDeduplicationWorkflow: Parquet files with duplicate IDs
    end
    
    SemanticDeduplicationWorkflow-->>User: Results dict with timing, duplicate counts

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T14:26:19Z

docs/curate-video/process-data/dedup.md

+# Run with XennaExecutor (GPU-accelerated)
+executor = XennaExecutor()
+results = workflow.run(executor)


The example shows passing a single executor to workflow.run(), but the actual method signature is run(self, kmeans_executor: BaseExecutor | None = None, pairwise_executor: BaseExecutor | None = None).

When you pass executor as the first positional argument, it will be assigned to kmeans_executor, not pairwise_executor. However, the workflow explicitly requires kmeans_executor to be a RayActorPoolExecutor (see lines 343-345 of nemo_curator/stages/deduplication/semantic/workflow.py). Passing XennaExecutor() as kmeans_executor will cause a runtime error.

The correct usage should be:

Suggested change

# Run with XennaExecutor (GPU-accelerated)

executor = XennaExecutor()

results = workflow.run(executor)

# Run with XennaExecutor (GPU-accelerated)

from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor

kmeans_executor = RayActorPoolExecutor()

pairwise_executor = XennaExecutor()

results = workflow.run(kmeans_executor, pairwise_executor)

Or, to use defaults (which automatically set kmeans to RayActorPoolExecutor and pairwise to XennaExecutor):

Suggested change

# Run with XennaExecutor (GPU-accelerated)

executor = XennaExecutor()

results = workflow.run(executor)

# Run with default executors (RayActorPoolExecutor for kmeans, XennaExecutor for pairwise)

results = workflow.run()

Signed-off-by: Ao Tang <aot@nvidia.com>

suiyoubi · 2026-01-13T20:04:51Z

/ok to test 05c5b8b

praateekmahajan and others added 3 commits November 19, 2025 12:57

..

ef7c3c1

Signed-off-by: Praateek <praateekm@gmail.com>

Merge branch 'main' into praateek/video-docs-update

7c687e8

fix closing directives

dbf6442

Signed-off-by: Ao Tang <aot@nvidia.com>

Merge branch 'main' into pr-1248

c8f7aa9

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

fixing directives

05c5b8b

Signed-off-by: Ao Tang <aot@nvidia.com>

copy-pr-bot bot temporarily deployed to test January 13, 2026 20:05 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 13, 2026 20:05 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 20:35 Inactive

suiyoubi merged commit b0f8ef2 into main Feb 4, 2026
49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Video tutorial improve#1367

Video tutorial improve#1367
suiyoubi merged 11 commits intomainfrom
pr-1248

suiyoubi commented Jan 13, 2026

Uh oh!

copy-pr-bot bot commented Jan 13, 2026

Uh oh!

suiyoubi commented Jan 13, 2026

Uh oh!

greptile-apps bot commented Jan 13, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 13, 2026

Uh oh!

greptile-apps bot Jan 13, 2026

Uh oh!

greptile-apps bot commented Jan 13, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 13, 2026

Uh oh!

suiyoubi commented Jan 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-# Run with XennaExecutor (GPU-accelerated)
-executor = XennaExecutor()
-results = workflow.run(executor)
+# Run the workflow (uses default executors: RayActorPoolExecutor for K-means, XennaExecutor for pairwise)
+results = workflow.run()
+# Or explicitly specify executors:
+# from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor
+# results = workflow.run(kmeans_executor=RayActorPoolExecutor(), pairwise_executor=XennaExecutor())


		The duplicate identification stages (`IdentifyDuplicatesStage` or `SemanticDeduplicationWorkflow` with `eps` specified) write Parquet files containing duplicate clip IDs to the output directory (typically `output_path/duplicates/`). These files contain a single column `id` with the IDs of clips that should be removed.

		It is your responsibility to exclude these duplicate IDs when exporting or persisting your final dataset. The removal process depends on how you want to persist and shard your data:

Conversation

suiyoubi commented Jan 13, 2026

Description

Usage

Checklist

Uh oh!

copy-pr-bot bot commented Jan 13, 2026

Uh oh!

suiyoubi commented Jan 13, 2026

Uh oh!

greptile-apps bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 13, 2026

Greptile Overview

Greptile Summary

Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

suiyoubi commented Jan 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Jan 13, 2026 •

edited

Loading