Conversation
Signed-off-by: Ao Tang <aot@nvidia.com>
|
/ok to test c8f7aa9 |
Greptile OverviewGreptile SummaryThis PR significantly improves video tutorial documentation by expanding the deduplication guide, removing deprecated resource fields ( Major changes:
Known issues (already flagged in previous review threads):
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Docs as Documentation
participant Workflow as SemanticDeduplicationWorkflow
participant KMeans as KMeansStage
participant Pairwise as PairwiseStage
participant Identify as IdentifyDuplicatesStage
Note over User,Docs: This PR improves video tutorial documentation
User->>Docs: Read dedup documentation
Note over Docs: Added comprehensive workflow examples
Note over Docs: Expanded parameter documentation
Note over Docs: Added tips for eps selection
Note over User,Identify: Example workflow in updated docs
User->>Workflow: Create SemanticDeduplicationWorkflow
User->>Workflow: workflow.run(executor)
Note right of Workflow: ⚠️ ISSUE: Incorrect API usage<br/>Should use named parameters:<br/>kmeans_executor, pairwise_executor
Workflow->>KMeans: Run K-means clustering
KMeans-->>Workflow: Clustered embeddings
Workflow->>Pairwise: Compute pairwise similarity
Pairwise-->>Workflow: Similarity scores
Workflow->>Identify: Identify duplicates
Identify-->>Workflow: Duplicate IDs
Workflow-->>User: Results
Note over User,Docs: Documentation also updates:
Note over Docs: Removed nvdecs/nvencs from Resources
Note over Docs: Fixed cross-reference labels
Note over Docs: Improved formatting and organization
|
| # Run with XennaExecutor (GPU-accelerated) | ||
| executor = XennaExecutor() | ||
| results = workflow.run(executor) |
There was a problem hiding this comment.
Incorrect API usage: The workflow.run() method signature is run(self, kmeans_executor: BaseExecutor | None = None, pairwise_executor: BaseExecutor | None = None), but this example passes a single executor parameter.
When you call workflow.run(executor), the executor is interpreted as kmeans_executor. However, the SemanticDeduplicationWorkflow requires kmeans_executor to be a RayActorPoolExecutor (see line 343-345 of the workflow.py implementation). Passing XennaExecutor() will cause a runtime error: "kmeans_executor must be an instance of RayActorPoolExecutor."
Fix: Update the example to correctly specify both executors or rely on the defaults:
| # Run with XennaExecutor (GPU-accelerated) | |
| executor = XennaExecutor() | |
| results = workflow.run(executor) | |
| # Run the workflow (uses default executors: RayActorPoolExecutor for K-means, XennaExecutor for pairwise) | |
| results = workflow.run() | |
| # Or explicitly specify executors: | |
| # from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor | |
| # results = workflow.run(kmeans_executor=RayActorPoolExecutor(), pairwise_executor=XennaExecutor()) |
Alternatively, if you want to keep showing an explicit executor call, the documentation should clarify the correct parameter names and requirements.
|
|
||
| The duplicate identification stages (`IdentifyDuplicatesStage` or `SemanticDeduplicationWorkflow` with `eps` specified) write Parquet files containing duplicate clip IDs to the output directory (typically `output_path/duplicates/`). These files contain a single column `id` with the IDs of clips that should be removed. | ||
|
|
||
| **It is your responsibility to exclude these duplicate IDs when exporting or persisting your final dataset.** The removal process depends on how you want to persist and shard your data: |
There was a problem hiding this comment.
Incomplete section: The "Removing Duplicates" section ends with a colon but provides no follow-up content. The sentence reads: "The removal process depends on how you want to persist and shard your data:" but nothing follows.
This appears to be incomplete documentation that should explain how users can actually filter out the duplicate IDs when exporting their final dataset. Consider adding concrete examples or instructions for common export scenarios (e.g., using pandas/dask to filter parquet files, filtering during WebDataset creation, etc.).
Greptile OverviewGreptile SummarySummaryThis PR improves video curation documentation by expanding the semantic deduplication guide and updating various cross-references. Key Changes:
Issue Found: The new Confidence Score: 3/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User
participant SemanticDeduplicationWorkflow
participant KMeansStage
participant PairwiseStage
participant IdentifyDuplicatesStage
participant RayActorPoolExecutor
participant XennaExecutor
User->>SemanticDeduplicationWorkflow: Initialize with input_path, n_clusters, eps, etc.
User->>SemanticDeduplicationWorkflow: run(kmeans_executor, pairwise_executor)
Note over SemanticDeduplicationWorkflow: Setup directories<br/>(kmeans_results, pairwise_results, duplicates)
SemanticDeduplicationWorkflow->>KMeansStage: Create stage with n_clusters, embedding_dim, etc.
SemanticDeduplicationWorkflow->>RayActorPoolExecutor: execute(KMeansStage)
RayActorPoolExecutor->>KMeansStage: Process embeddings
KMeansStage-->>SemanticDeduplicationWorkflow: Clustered embeddings with centroid distances
SemanticDeduplicationWorkflow->>PairwiseStage: Create stage with ranking_strategy, batch_size
SemanticDeduplicationWorkflow->>XennaExecutor: execute(PairwiseStage)
XennaExecutor->>PairwiseStage: Compute within-cluster similarity
PairwiseStage-->>SemanticDeduplicationWorkflow: Pairwise similarity scores
alt eps is specified
SemanticDeduplicationWorkflow->>IdentifyDuplicatesStage: Create stage with eps threshold
SemanticDeduplicationWorkflow->>XennaExecutor: execute(IdentifyDuplicatesStage)
XennaExecutor->>IdentifyDuplicatesStage: Filter pairs where cosine_sim >= 1.0 - eps
IdentifyDuplicatesStage-->>SemanticDeduplicationWorkflow: Parquet files with duplicate IDs
end
SemanticDeduplicationWorkflow-->>User: Results dict with timing, duplicate counts
|
| # Run with XennaExecutor (GPU-accelerated) | ||
| executor = XennaExecutor() | ||
| results = workflow.run(executor) |
There was a problem hiding this comment.
The example shows passing a single executor to workflow.run(), but the actual method signature is run(self, kmeans_executor: BaseExecutor | None = None, pairwise_executor: BaseExecutor | None = None).
When you pass executor as the first positional argument, it will be assigned to kmeans_executor, not pairwise_executor. However, the workflow explicitly requires kmeans_executor to be a RayActorPoolExecutor (see lines 343-345 of nemo_curator/stages/deduplication/semantic/workflow.py). Passing XennaExecutor() as kmeans_executor will cause a runtime error.
The correct usage should be:
| # Run with XennaExecutor (GPU-accelerated) | |
| executor = XennaExecutor() | |
| results = workflow.run(executor) | |
| # Run with XennaExecutor (GPU-accelerated) | |
| from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor | |
| kmeans_executor = RayActorPoolExecutor() | |
| pairwise_executor = XennaExecutor() | |
| results = workflow.run(kmeans_executor, pairwise_executor) |
Or, to use defaults (which automatically set kmeans to RayActorPoolExecutor and pairwise to XennaExecutor):
| # Run with XennaExecutor (GPU-accelerated) | |
| executor = XennaExecutor() | |
| results = workflow.run(executor) | |
| # Run with default executors (RayActorPoolExecutor for kmeans, XennaExecutor for pairwise) | |
| results = workflow.run() |
Signed-off-by: Ao Tang <aot@nvidia.com>
|
/ok to test 05c5b8b |
Description
Usage
# Add snippet demonstrating usageChecklist