Add ALM Data Pipeline tutorial and stages by mohammadaaftabv · Pull Request #1419 · NVIDIA-NeMo/Curator

mohammadaaftabv · 2026-01-23T16:56:24Z

Add new NeMo Curator stages for ALM (Audio Language Model) data curation:

ALMDataBuilderStage: Creates training windows from audio segments with quality filtering (sample rate, bandwidth, speaker count, duration)
ALMDataOverlapStage: Filters overlapping windows based on threshold, keeping windows closest to target duration

Add complete tutorial with:

Python CLI (pipeline.py) and Hydra runner (run.py)
Sample input data for testing
Comprehensive documentation

Tested with sample data:

Stage 1 produces 181 windows from 5 input entries
Stage 2 filters to 25 non-overlapping windows (3035.5s total)

Description

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-01-23T16:56:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-01-29T09:41:03Z

Greptile Summary

This PR adds a complete ALM (Audio Language Model) data curation pipeline with two main processing stages, manifest I/O, comprehensive tests, documentation, and benchmarking support.

Key Changes

Core Stages:

ALMDataBuilderStage - Creates training windows from audio segments with quality filters (sample rate ≥16kHz, bandwidth ≥8kHz, 2-5 speakers, ~120s duration)
ALMDataOverlapStage - Filters overlapping windows based on configurable threshold, keeping windows closest to target duration
ALMManifestReader - CompositeStage that reads JSONL manifests line-by-line without Pandas for better memory efficiency
ALMManifestWriterStage - Writes processed entries to JSONL output

Tutorial & Documentation:

Complete Hydra-based pipeline with YAML configuration (tutorials/audio/alm/)
Comprehensive README with usage examples, input/output formats, configuration parameters
Sample data fixtures for testing and tutorials

Testing:

38 total tests across 4 test files covering unit and integration scenarios
Shared fixtures in conftest.py for consistent test data

Benchmarking:

Integration with nightly benchmark framework
Comprehensive metrics collection (throughput, window counts, duration)

Architecture

The pipeline follows NeMo Curator patterns by inheriting from LegacySpeechStage and CompositeStage, integrating cleanly with the existing executor framework (Xenna, Ray). The manifest reader uses fsspec for cloud storage support (S3, GCS) and avoids Pandas to prevent memory blow-up with deeply nested audio metadata.

Issues Noted in Previous Reviews

Several issues have been flagged in previous review threads that should be addressed before merge (see "Previous Threads" section). The most critical are tuple order inconsistencies in the overlap stage, brittle test assertions, and threshold semantics that may not match documentation.

Confidence Score: 3/5

This PR adds substantial new functionality that is well-tested and documented, but has several code quality and maintainability issues flagged in previous review threads that should be addressed.
The core windowing and overlap filtering logic appears sound with 38 comprehensive tests. However, multiple issues identified in previous threads affect code quality: confusing tuple order convention requiring compensating swaps throughout the overlap stage, brittle test assertions with hardcoded golden values, hardcoded bandwidth threshold not using the configurable parameter, and potentially inverted threshold semantics. These don't appear to be critical correctness bugs, but they impact maintainability and could cause issues during future refactoring or when users configure non-default parameters.
Pay close attention to nemo_curator/stages/audio/alm/alm_data_overlap.py (tuple order inconsistency), nemo_curator/stages/audio/alm/alm_data_builder.py (hardcoded bandwidth value), and the integration test files with brittle assertions.

Important Files Changed

Filename	Overview
nemo_curator/stages/audio/alm/alm_data_builder.py	Implements window creation from audio segments with quality filtering (sample rate, bandwidth, speaker count, duration). Core windowing logic is sound but has a hardcoded `8000` bandwidth threshold in diagnostic code (line 112) instead of using `min_bandwidth` parameter.
nemo_curator/stages/audio/alm/alm_data_overlap.py	Filters overlapping windows based on threshold. Has confusing tuple order convention (stores as `(end, start)` but processes as `(start, end)` with swaps throughout), and threshold semantics may be inverted from documentation expectations.
nemo_curator/stages/audio/alm/alm_manifest_reader.py	CompositeStage for reading JSONL manifests line-by-line without Pandas. Clean implementation with proper memory handling. Log message on line 53 shows cumulative entry count across all manifests rather than per-manifest count.
nemo_curator/stages/audio/alm/alm_manifest_writer.py	Writes AudioBatch entries to JSONL manifest with proper single-writer constraint. Clean implementation with good fsspec integration for cloud storage support.
tests/stages/audio/alm/test_alm_data_builder.py	Unit and integration tests for builder stage with good coverage. Integration test on line 163 has brittle assertion with hardcoded count (181) that will break on any logic changes or fixture modifications.
tests/stages/audio/alm/test_alm_data_overlap.py	Unit and integration tests for overlap stage. Integration test on line 142-144 has brittle assertions with hardcoded counts (181, 25, 3035.50) sensitive to any changes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Input JSONL Manifests<br/>audio segments with<br/>diarization metadata] --> B[ALMManifestReader<br/>CompositeStage]
    B --> C[FilePartitioningStage<br/>discover & partition files]
    C --> D[ALMManifestReaderStage<br/>read line-by-line, no Pandas]
    D --> E[ALMDataBuilderStage<br/>create training windows]
    E --> F{Quality Filters}
    F -->|sample rate < 16kHz| G[Reject entry]
    F -->|bandwidth < 8kHz| H[Reject segment]
    F -->|speakers not 2-5| I[Reject window]
    F -->|duration not 108-132s| J[Reject window]
    F -->|all filters pass| K[Accept window]
    K --> L[ALMDataOverlapStage<br/>filter overlapping windows]
    L --> M{Overlap Check}
    M -->|overlap ratio >= threshold| N[Keep window closer<br/>to target duration]
    M -->|overlap ratio < threshold| O[Keep both windows]
    N --> P[ALMManifestWriterStage<br/>write JSONL output]
    O --> P
    P --> Q[Output JSONL<br/>filtered windows +<br/>stats + durations]
    
    style A fill:#e1f5ff
    style Q fill:#d4edda
    style G fill:#f8d7da
    style H fill:#f8d7da
    style I fill:#f8d7da
    style J fill:#f8d7da

_{Last reviewed commit: 09c1ad5}

greptile-apps

_{30 files reviewed, 17 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_curator/stages/audio/alm/alm_data_builder.py

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

karpnv

LGTM

ayushdg

Few comments:

Can you add a benchmarking script to benchmarks and share a representative dataset that can be used to run an alm pipeline.
You are already logging many statistics in the stages here, is it possible to also use _log_metrics like done in some of the text stages to log some of these timing metrics so that they can be tracked better to catch regressions?

nemo_curator/stages/audio/alm/alm_data_builder.py

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

mohammadaaftabv · 2026-02-07T12:20:46Z

Few comments:

Can you add a benchmarking script to benchmarks and share a representative dataset that can be used to run an alm pipeline.

https://github.com/mohammadaaftabv/Curator/tree/alm_data_build/tests/fixtures/audio/alm is the representative dataset and i am assuming by benchmarks you mean result of running both processors on the representative data, in that case alm data build should build 181 windows based on config in test file and alm data overlap applied on resultant 181 windows with allowing max 50% overlap will give 3035.5 seconds total output.

All this is in test cases here.

mohammadaaftabv · 2026-02-07T12:57:01Z

2. You are already logging many statistics in the stages here, is it possible to also use _log_metrics like done in some of the text stages to log some of these timing metrics so that they can be tracked better to catch regressions?

Added _log_metrics calls to both stages, following the pattern in text stages. Now tracking:

ALMDataBuilderStage: process_entry_time, segments_processed, windows_created

ALMDataOverlapStage: filter_time, input_windows, output_windows"

greptile-apps

_{13 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_curator/stages/audio/alm/alm_data_overlap.py

tutorials/audio/alm/main.py

nemo_curator/stages/audio/alm/alm_data_overlap.py

tests/stages/audio/alm/test_alm_data_overlap.py

greptile-apps

_{13 files reviewed, 5 comments}

_{Edit Code Review Agent Settings | Greptile}

tutorials/audio/alm/main.py

nemo_curator/stages/audio/alm/alm_data_overlap.py

tests/stages/audio/alm/test_alm_data_builder.py

tutorials/audio/alm/main.py

greptile-apps

_{3 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

tutorials/audio/alm/main.py

nemo_curator/stages/audio/alm/alm_data_overlap.py

greptile-apps

_{13 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

nemo_curator/stages/audio/alm/alm_data_overlap.py

tutorials/audio/alm/main.py

praateekmahajan · 2026-02-12T19:17:49Z

tutorials/audio/alm/main.py

+
+    # Calculate statistics
+    # Stage 1 output: total_dur_list_window contains the original window count
+    stage1_windows = sum(len(e.get("total_dur_list_window", e.get("windows", []))) for e in output_entries)


I guess these make sense, but also take a look at Task._metadata and Task._stage_perf_stats if there are things that are relevant

nemo_curator/stages/audio/alm/alm_data_builder.py

praateekmahajan · 2026-02-12T19:18:30Z

nemo_curator/stages/audio/alm/alm_data_builder.py

+        self._drop_fields_set = {f.strip() for f in self.drop_fields.split(",") if f.strip()}
+        self._drop_fields_top_level_set = {f.strip() for f in self.drop_fields_top_level.split(",") if f.strip()}
+
+    def process_dataset_entry(self, data_entry: dict[str, Any]) -> list[AudioBatch]:


Is it intentional that we operate on a single manifest entry at a time? Can any of this be vectorized? Same for other stages

Yes, this is intentional — it follows the LegacySpeechStage pattern used by all other audio stages (GetAudioDurationStage, PreserveByValueStage, etc.), where process() iterates over task.data and calls process_dataset_entry() per entry.

Parallelism is handled at the executor level instead. In benchmark testing (10,000 entries, XennaExecutor on 8-core i9-9900KF), the autoscaler allocated 4 workers to the Builder stage, achieving ~1,460 entries/sec aggregate throughput (365 entries/sec/worker) with 86% CPU utilization. The Overlap stage ran 3 workers at ~5,650 entries/sec. Full pipeline completed in 90s.

If we want batch-level optimization in the future, it would need to happen at the LegacySpeechStage base class level, which would affect all audio stages.

praateekmahajan · 2026-02-12T19:19:17Z

nemo_curator/stages/audio/alm/alm_data_builder.py

+            }
+        )
+
+        return [AudioBatch(data=[result])]


Each time we return a Task you must pass its parents tasks metadata and stage_perf_stats..

In such a fan-out implementation this becomes hard to reason about..

Yeah the _stage_perfs are supposedly propagated via the base LegacySpeechStage. I would be curious to look at the benchmark results for this PR though to get an even better understanding of how existing audio curation code can be refactored.

nemo_curator/stages/audio/alm/alm_data_builder.py

nemo_curator/stages/audio/alm/alm_data_overlap.py

nemo_curator/stages/audio/alm/alm_manifest_reader.py

tests/stages/audio/alm/test_alm_data_builder.py

tests/stages/audio/alm/test_alm_data_overlap.py

…tage

nemo_curator/stages/audio/alm/__init__.py

…uppressions, /tmp usage

nemo_curator/stages/audio/alm/__init__.py

nemo_curator/stages/audio/alm/alm_data_overlap.py

nemo_curator/stages/audio/alm/alm_manifest_reader.py

tests/stages/audio/alm/test_alm_data_builder.py

tests/stages/audio/alm/test_alm_data_overlap.py

greptile-apps · 2026-02-26T22:56:18Z

Additional Comments (3)

nemo_curator/stages/audio/alm/alm_data_builder.py, line 168
Missing validation for stage parameters. Add checks to ensure:

min_speakers <= max_speakers
tolerance >= 0 (reasonable range)
target_window_duration > 0
min_sample_rate > 0

Without validation, invalid configs can cause confusing runtime behavior.

nemo_curator/stages/audio/alm/alm_data_builder.py, line 203
Potential KeyError if segments are missing required fields start, end. Consider adding validation or try/except to handle malformed input data gracefully with a clear error message.

nemo_curator/stages/audio/alm/alm_data_overlap.py, line 62
Non-standard tuple order (end, start) requires compensating index swaps throughout the file (lines 67-68, 87-88, 94-95). This makes the code harder to understand and maintain. Consider using standard (start, end) convention or add a clear comment explaining why this order is intentional.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-02-27T17:48:25Z

Additional Comments (5)

nemo_curator/stages/audio/alm/alm_data_builder.py, line 112
hardcoded 8000 should use self.min_bandwidth for correct diagnostics when users configure different bandwidth thresholds

nemo_curator/stages/audio/alm/alm_data_overlap.py, line 62
returns (end, start) tuple order, requiring index swaps at lines 67-68, 87-88, 94-95 and unpacking as for end, start at lines 121, 126. standard (start, end) order would improve readability

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

tests/stages/audio/alm/test_alm_data_builder.py, line 163
hardcoded count 181 makes test brittle - will break on fixture changes or logic tweaks. consider asserting total_windows > 0 instead

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

tests/stages/audio/alm/test_alm_data_overlap.py, line 144
hardcoded counts 181, 25, 3035.50 make tests brittle. consider asserting invariants like total_filtered_windows <= total_builder_windows and filtered_dur > 0

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

nemo_curator/stages/audio/alm/alm_manifest_reader.py, line 53
logs cumulative total len(entries) across all manifests, not count from current manifest. track per-manifest count separately for accurate logging

github-actions bot added the community-request label Jan 23, 2026

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 25, 2026

mohammadaaftabv force-pushed the alm_data_build branch 4 times, most recently from 66abf28 to 0125f32 Compare January 29, 2026 09:36

mohammadaaftabv marked this pull request as ready for review January 29, 2026 09:38

greptile-apps bot reviewed Jan 29, 2026

View reviewed changes

nemo_curator/stages/audio/alm/alm_data_builder.py Outdated Show resolved Hide resolved

nemo_curator/stages/audio/alm/alm_data_builder.py Outdated Show resolved Hide resolved

greptile-apps bot reviewed Jan 29, 2026

View reviewed changes

karpnv self-requested a review January 31, 2026 00:35

karpnv approved these changes Jan 31, 2026

View reviewed changes

ayushdg reviewed Feb 5, 2026

View reviewed changes

greptile-apps bot reviewed Feb 7, 2026

View reviewed changes

tutorials/audio/alm/main.py Outdated Show resolved Hide resolved

nemo_curator/stages/audio/alm/alm_data_overlap.py Outdated Show resolved Hide resolved

mohammadaaftabv requested review from ayushdg and karpnv February 9, 2026 02:41

greptile-apps bot reviewed Feb 10, 2026

View reviewed changes

nemo_curator/stages/audio/alm/alm_data_overlap.py Show resolved Hide resolved

praateekmahajan reviewed Feb 12, 2026

View reviewed changes

tutorials/audio/alm/main.py Outdated Show resolved Hide resolved

praateekmahajan reviewed Feb 12, 2026

View reviewed changes

tutorials/audio/alm/main.py Outdated Show resolved Hide resolved

praateekmahajan reviewed Feb 12, 2026

View reviewed changes

tutorials/audio/alm/main.py Outdated Show resolved Hide resolved

praateekmahajan reviewed Feb 12, 2026

View reviewed changes

nemo_curator/stages/audio/alm/alm_data_builder.py Outdated Show resolved Hide resolved

praateekmahajan reviewed Feb 12, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci February 26, 2026 18:29 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 26, 2026 18:29 Failure

copy-pr-bot bot temporarily deployed to nemo-ci February 26, 2026 18:29 Inactive

sarahyurick mentioned this pull request Feb 26, 2026

Audio Dataverse Integration #1527

Open

Merge branch 'main' into alm_data_build

0e2db7f

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

Refactor ALMManifestReader into CompositeStage with FilePartitioningS…

fa3d1f4

…tage

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

nemo_curator/stages/audio/alm/__init__.py Show resolved Hide resolved

Fix all ruff linting errors: type annotations, sorted __all__, noqa s…

d641d78

…uppressions, /tmp usage

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

nemo_curator/stages/audio/alm/__init__.py Show resolved Hide resolved

Restore alm_manifest_reader.py lost during rebase

b4b11bf

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

mohammadaaftabv added 3 commits February 27, 2026 02:53

Add directory-based manifest discovery tests and README docs

06f6a7d

Fix ruff errors, add directory discovery tests and README example

43a6658

Add storage_options, blocksize, fix benchmark, update results

2334ddd

mohammadaaftabv requested a review from sarahyurick February 26, 2026 22:56

Merge branch 'main' into alm_data_build

5b8e19f

mohammadaaftabv and others added 2 commits February 27, 2026 23:50

Ensure consistent output schema when entry has no windows

c0f5cc0

Merge branch 'main' into alm_data_build

09c1ad5

Conversation

mohammadaaftabv commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Checklist

Uh oh!

copy-pr-bot bot commented Jan 23, 2026

Uh oh!

greptile-apps bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Key Changes

Architecture

Issues Noted in Previous Reviews

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

karpnv left a comment

Choose a reason for hiding this comment

Uh oh!

ayushdg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

mohammadaaftabv commented Feb 7, 2026

Uh oh!

mohammadaaftabv commented Feb 7, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

praateekmahajan Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

praateekmahajan Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

mohammadaaftabv Feb 19, 2026

mohammadaaftabv commented Jan 23, 2026 •

edited

Loading

greptile-apps bot commented Jan 29, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading