Skip to content

Video tutorial improve#1367

Merged
suiyoubi merged 11 commits intomainfrom
pr-1248
Feb 4, 2026
Merged

Video tutorial improve#1367
suiyoubi merged 11 commits intomainfrom
pr-1248

Conversation

@suiyoubi
Copy link
Contributor

Description

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

praateekmahajan and others added 3 commits November 19, 2025 12:57
Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@suiyoubi
Copy link
Contributor Author

/ok to test c8f7aa9

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 13, 2026

Greptile Overview

Greptile Summary

This PR significantly improves video tutorial documentation by expanding the deduplication guide, removing deprecated resource fields (nvdecs/nvencs), and fixing broken cross-reference labels throughout the documentation.

Major changes:

  • Expanded docs/curate-video/process-data/dedup.md with comprehensive SemanticDeduplicationWorkflow examples, parameter documentation, and practical tips for eps selection
  • Removed deprecated nvdecs/nvencs fields from Resources class documentation across multiple files
  • Fixed broken cross-reference labels (e.g., reference-infrastructure-container-environments-mainreference-infrastructure-container-environments)
  • Improved formatting and organization in video loading documentation

Known issues (already flagged in previous review threads):

  • Line 74 in dedup.md: Incorrect API usage - workflow.run(executor) passes executor as kmeans_executor, but the workflow requires kmeans_executor to be a RayActorPoolExecutor. Should use workflow.run() with defaults or workflow.run(pairwise_executor=executor).
  • Line 306 in dedup.md: "Removing Duplicates" section ends incomplete with "The removal process depends on how you want to persist and shard your data:" but provides no follow-up content.

Confidence Score: 4/5

  • Safe to merge after addressing the two known documentation issues in dedup.md
  • Score reflects that this is primarily a documentation improvement PR with good changes (removing deprecated fields, fixing cross-references, expanding tutorials), but contains two issues in the main dedup.md file that have been identified in previous review threads and should be fixed before merge
  • docs/curate-video/process-data/dedup.md requires attention for API usage correction and completing the final section

Important Files Changed

Filename Overview
docs/curate-video/process-data/dedup.md Major expansion of dedup documentation with workflow examples, but contains incorrect API usage and incomplete section at end
docs/curate-video/load-data/index.md Improved formatting and structure of video loading documentation with better organization
.github/copilot-instructions.md Removed deprecated nvdecs/nvencs resource fields from Resources class documentation
api-design.md Removed deprecated nvdecs/nvencs resource fields from Resources class specification

Sequence Diagram

sequenceDiagram
    participant User
    participant Docs as Documentation
    participant Workflow as SemanticDeduplicationWorkflow
    participant KMeans as KMeansStage
    participant Pairwise as PairwiseStage
    participant Identify as IdentifyDuplicatesStage
    
    Note over User,Docs: This PR improves video tutorial documentation
    
    User->>Docs: Read dedup documentation
    Note over Docs: Added comprehensive workflow examples
    Note over Docs: Expanded parameter documentation
    Note over Docs: Added tips for eps selection
    
    Note over User,Identify: Example workflow in updated docs
    User->>Workflow: Create SemanticDeduplicationWorkflow
    User->>Workflow: workflow.run(executor)
    Note right of Workflow: ⚠️ ISSUE: Incorrect API usage<br/>Should use named parameters:<br/>kmeans_executor, pairwise_executor
    
    Workflow->>KMeans: Run K-means clustering
    KMeans-->>Workflow: Clustered embeddings
    Workflow->>Pairwise: Compute pairwise similarity
    Pairwise-->>Workflow: Similarity scores
    Workflow->>Identify: Identify duplicates
    Identify-->>Workflow: Duplicate IDs
    Workflow-->>User: Results
    
    Note over User,Docs: Documentation also updates:
    Note over Docs: Removed nvdecs/nvencs from Resources
    Note over Docs: Fixed cross-reference labels
    Note over Docs: Improved formatting and organization
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +73 to +75
# Run with XennaExecutor (GPU-accelerated)
executor = XennaExecutor()
results = workflow.run(executor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect API usage: The workflow.run() method signature is run(self, kmeans_executor: BaseExecutor | None = None, pairwise_executor: BaseExecutor | None = None), but this example passes a single executor parameter.

When you call workflow.run(executor), the executor is interpreted as kmeans_executor. However, the SemanticDeduplicationWorkflow requires kmeans_executor to be a RayActorPoolExecutor (see line 343-345 of the workflow.py implementation). Passing XennaExecutor() will cause a runtime error: "kmeans_executor must be an instance of RayActorPoolExecutor."

Fix: Update the example to correctly specify both executors or rely on the defaults:

Suggested change
# Run with XennaExecutor (GPU-accelerated)
executor = XennaExecutor()
results = workflow.run(executor)
# Run the workflow (uses default executors: RayActorPoolExecutor for K-means, XennaExecutor for pairwise)
results = workflow.run()
# Or explicitly specify executors:
# from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor
# results = workflow.run(kmeans_executor=RayActorPoolExecutor(), pairwise_executor=XennaExecutor())

Alternatively, if you want to keep showing an explicit executor call, the documentation should clarify the correct parameter names and requirements.


The duplicate identification stages (`IdentifyDuplicatesStage` or `SemanticDeduplicationWorkflow` with `eps` specified) write Parquet files containing duplicate clip IDs to the output directory (typically `output_path/duplicates/`). These files contain a single column `id` with the IDs of clips that should be removed.

**It is your responsibility to exclude these duplicate IDs when exporting or persisting your final dataset.** The removal process depends on how you want to persist and shard your data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete section: The "Removing Duplicates" section ends with a colon but provides no follow-up content. The sentence reads: "The removal process depends on how you want to persist and shard your data:" but nothing follows.

This appears to be incomplete documentation that should explain how users can actually filter out the duplicate IDs when exporting their final dataset. Consider adding concrete examples or instructions for common export scenarios (e.g., using pandas/dask to filter parquet files, filtering during WebDataset creation, etc.).

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 13, 2026

Greptile Overview

Greptile Summary

Summary

This PR improves video curation documentation by expanding the semantic deduplication guide and updating various cross-references.

Key Changes:

  1. Semantic Deduplication (dedup.md) - Major expansion:

    • Added new SemanticDeduplicationWorkflow end-to-end example with comprehensive parameter documentation
    • Expanded "Individual Stages" example with detailed pipeline construction
    • Added two helpful tips: one for determining the eps parameter through exploration, and another for custom ranking with metadata columns
    • Completed previously truncated parameter tables for all stages (KMeansStage, PairwiseStage, IdentifyDuplicatesStage, SemanticDeduplicationWorkflow)
    • Changed title from "Duplicate Removal" to "Duplicate Identification" (more accurate, as these stages only identify duplicates)
    • Added new "Removing Duplicates" section clarifying user responsibility for actual removal
  2. Reference Link Updates:

    • Fixed reference in index.md from custom stage tutorial to execution backends documentation (more relevant to the context)
  3. Documentation Restructuring:

    • Reorganized "How it Works" section in load-data/index.md for improved readability
    • Reordered sentences in frame-extraction.md for better logical flow

Issue Found:

The new SemanticDeduplicationWorkflow example contains a critical bug where workflow.run(executor) passes the executor to the wrong parameter position, which will cause a runtime error. The workflow requires RayActorPoolExecutor for k-means but the example passes XennaExecutor as the first positional argument.

Confidence Score: 3/5

  • This PR is mostly safe but contains one critical documentation bug that will cause runtime errors for users following the example code
  • The PR makes valuable improvements to documentation structure and adds comprehensive new content for semantic deduplication. However, the main "Single Step Workflow" example in dedup.md has a critical bug where the executor is passed incorrectly to workflow.run(), which will cause a TypeError at runtime when users try to follow the example. The rest of the changes (reference updates, restructuring) are solid improvements with no issues. Score of 3 reflects that most changes are good, but the critical bug in the most prominent example needs to be fixed before merge.
  • docs/curate-video/process-data/dedup.md requires attention - the SemanticDeduplicationWorkflow example needs correction before users encounter runtime errors

Important Files Changed

File Analysis

Filename Score Overview
docs/curate-video/index.md 5/5 Updated reference link from custom stage tutorial to execution backends documentation - correct and improves navigation
docs/curate-video/load-data/index.md 5/5 Restructured "How it Works" section for better readability, removed redundant subsections - improves documentation clarity
docs/curate-video/process-data/frame-extraction.md 5/5 Reordered sentences in "Before You Start" section for better logical flow - minor improvement
docs/curate-video/process-data/dedup.md 2/5 Major expansion with new SemanticDeduplicationWorkflow examples and parameter tables, but contains critical bug in workflow.run() usage that will cause runtime error

Sequence Diagram

sequenceDiagram
    participant User
    participant SemanticDeduplicationWorkflow
    participant KMeansStage
    participant PairwiseStage
    participant IdentifyDuplicatesStage
    participant RayActorPoolExecutor
    participant XennaExecutor

    User->>SemanticDeduplicationWorkflow: Initialize with input_path, n_clusters, eps, etc.
    User->>SemanticDeduplicationWorkflow: run(kmeans_executor, pairwise_executor)
    
    Note over SemanticDeduplicationWorkflow: Setup directories<br/>(kmeans_results, pairwise_results, duplicates)
    
    SemanticDeduplicationWorkflow->>KMeansStage: Create stage with n_clusters, embedding_dim, etc.
    SemanticDeduplicationWorkflow->>RayActorPoolExecutor: execute(KMeansStage)
    RayActorPoolExecutor->>KMeansStage: Process embeddings
    KMeansStage-->>SemanticDeduplicationWorkflow: Clustered embeddings with centroid distances
    
    SemanticDeduplicationWorkflow->>PairwiseStage: Create stage with ranking_strategy, batch_size
    SemanticDeduplicationWorkflow->>XennaExecutor: execute(PairwiseStage)
    XennaExecutor->>PairwiseStage: Compute within-cluster similarity
    PairwiseStage-->>SemanticDeduplicationWorkflow: Pairwise similarity scores
    
    alt eps is specified
        SemanticDeduplicationWorkflow->>IdentifyDuplicatesStage: Create stage with eps threshold
        SemanticDeduplicationWorkflow->>XennaExecutor: execute(IdentifyDuplicatesStage)
        XennaExecutor->>IdentifyDuplicatesStage: Filter pairs where cosine_sim >= 1.0 - eps
        IdentifyDuplicatesStage-->>SemanticDeduplicationWorkflow: Parquet files with duplicate IDs
    end
    
    SemanticDeduplicationWorkflow-->>User: Results dict with timing, duplicate counts
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +73 to +75
# Run with XennaExecutor (GPU-accelerated)
executor = XennaExecutor()
results = workflow.run(executor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example shows passing a single executor to workflow.run(), but the actual method signature is run(self, kmeans_executor: BaseExecutor | None = None, pairwise_executor: BaseExecutor | None = None).

When you pass executor as the first positional argument, it will be assigned to kmeans_executor, not pairwise_executor. However, the workflow explicitly requires kmeans_executor to be a RayActorPoolExecutor (see lines 343-345 of nemo_curator/stages/deduplication/semantic/workflow.py). Passing XennaExecutor() as kmeans_executor will cause a runtime error.

The correct usage should be:

Suggested change
# Run with XennaExecutor (GPU-accelerated)
executor = XennaExecutor()
results = workflow.run(executor)
# Run with XennaExecutor (GPU-accelerated)
from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor
kmeans_executor = RayActorPoolExecutor()
pairwise_executor = XennaExecutor()
results = workflow.run(kmeans_executor, pairwise_executor)

Or, to use defaults (which automatically set kmeans to RayActorPoolExecutor and pairwise to XennaExecutor):

Suggested change
# Run with XennaExecutor (GPU-accelerated)
executor = XennaExecutor()
results = workflow.run(executor)
# Run with default executors (RayActorPoolExecutor for kmeans, XennaExecutor for pairwise)
results = workflow.run()

Signed-off-by: Ao Tang <aot@nvidia.com>
@suiyoubi
Copy link
Contributor Author

/ok to test 05c5b8b

@suiyoubi suiyoubi merged commit b0f8ef2 into main Feb 4, 2026
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants