Conversation
Greptile OverviewGreptile SummaryThis PR updates the executor documentation to remove maturity labels and reorganize content, promoting
However, the documentation has an inconsistency: while the changes suggest Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Docs as Documentation
participant ReleaseNotes as Release Notes
participant ExecutionBackends as Execution Backends Guide
User->>Docs: Read about executors
Docs->>ExecutionBackends: Navigate to execution backends
Note over ExecutionBackends: XennaExecutor section<br/>Clear import example provided
ExecutionBackends-->>User: Shows XennaExecutor usage
Note over ExecutionBackends: RayActorPoolExecutor section<br/>No import example shown
ExecutionBackends-->>User: Shows conceptual usage only
User->>ExecutionBackends: Try to use RayActorPoolExecutor
Note over User: Confusion: Where to import from?<br/>nemo_curator.backends.ray_actor_pool?<br/>nemo_curator.backends.experimental.ray_actor_pool?
User->>ReleaseNotes: Check architecture diagram
Note over ReleaseNotes: Diagram shows all executors<br/>without maturity labels
ReleaseNotes-->>User: No clarity on experimental status
Note over User: Import fails at runtime<br/>Needs to check source code
|
| ### `RayActorPoolExecutor` | ||
|
|
||
| `RayDataExecutor` uses Ray Data, a scalable data processing library built on Ray Core. Ray Data provides a familiar DataFrame-like API for distributed data transformations. This executor is experimental and best suited for large-scale batch processing tasks that benefit from Ray Data's optimized data loading and transformation pipelines. | ||
| `RayActorPoolExecutor` uses Ray's ActorPool for efficient distributed processing with fine-grained resource management. This executor creates pools of Ray actors per stage, enabling better load balancing and fault tolerance through Ray's native mechanisms. Deduplication workflows automatically use this executor for GPU-accelerated stages. | ||
|
|
||
| **Key Features**: | ||
| - **Ray Data API**: Leverages Ray Data's optimized data processing primitives | ||
| - **Scalable transformations**: Efficient map-batch operations across distributed workers | ||
| - **Experimental status**: API and performance characteristics may change | ||
| - **ActorPool-based execution**: Creates dedicated actor pools per stage for optimal resource utilization | ||
| - **Load balancing**: Uses `map_unordered` for efficient work distribution across actors | ||
| - **RAFT support**: Native integration with [RAFT](https://github.com/rapidsai/raft) (RAPIDS Analytics Framework Toolbox) for GPU-accelerated clustering and nearest-neighbor operations | ||
| - **Head node exclusion**: Optional `ignore_head_node` parameter to reserve the Ray cluster's [head node](https://docs.ray.io/en/latest/cluster/key-concepts.html#head-node) for coordination tasks only | ||
|
|
There was a problem hiding this comment.
missing import example for RayActorPoolExecutor
Unlike XennaExecutor and RayDataExecutor, there's no code example showing how to import RayActorPoolExecutor. Based on the codebase, the correct import is:
from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutorConsider adding an import example here for consistency and to help users.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
No description provided.