Skip to content

Add ALM Data Pipeline tutorial and stages#1419

Open
mohammadaaftabv wants to merge 25 commits intoNVIDIA-NeMo:mainfrom
mohammadaaftabv:alm_data_build
Open

Add ALM Data Pipeline tutorial and stages#1419
mohammadaaftabv wants to merge 25 commits intoNVIDIA-NeMo:mainfrom
mohammadaaftabv:alm_data_build

Conversation

@mohammadaaftabv
Copy link

@mohammadaaftabv mohammadaaftabv commented Jan 23, 2026

Add new NeMo Curator stages for ALM (Audio Language Model) data curation:

  • ALMDataBuilderStage: Creates training windows from audio segments with quality filtering (sample rate, bandwidth, speaker count, duration)
  • ALMDataOverlapStage: Filters overlapping windows based on threshold, keeping windows closest to target duration

Add complete tutorial with:

  • Python CLI (pipeline.py) and Hydra runner (run.py)
  • Sample input data for testing
  • Comprehensive documentation

Tested with sample data:

  • Stage 1 produces 181 windows from 5 input entries
  • Stage 2 filters to 25 non-overlapping windows (3035.5s total)

Description

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Jan 25, 2026
@mohammadaaftabv mohammadaaftabv force-pushed the alm_data_build branch 4 times, most recently from 66abf28 to 0125f32 Compare January 29, 2026 09:36
@mohammadaaftabv mohammadaaftabv marked this pull request as ready for review January 29, 2026 09:38
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 29, 2026

Greptile Summary

This PR adds a complete ALM (Audio Language Model) data curation pipeline with two main processing stages, manifest I/O, comprehensive tests, documentation, and benchmarking support.

Key Changes

Core Stages:

  • ALMDataBuilderStage - Creates training windows from audio segments with quality filters (sample rate ≥16kHz, bandwidth ≥8kHz, 2-5 speakers, ~120s duration)
  • ALMDataOverlapStage - Filters overlapping windows based on configurable threshold, keeping windows closest to target duration
  • ALMManifestReader - CompositeStage that reads JSONL manifests line-by-line without Pandas for better memory efficiency
  • ALMManifestWriterStage - Writes processed entries to JSONL output

Tutorial & Documentation:

  • Complete Hydra-based pipeline with YAML configuration (tutorials/audio/alm/)
  • Comprehensive README with usage examples, input/output formats, configuration parameters
  • Sample data fixtures for testing and tutorials

Testing:

  • 38 total tests across 4 test files covering unit and integration scenarios
  • Shared fixtures in conftest.py for consistent test data

Benchmarking:

  • Integration with nightly benchmark framework
  • Comprehensive metrics collection (throughput, window counts, duration)

Architecture

The pipeline follows NeMo Curator patterns by inheriting from LegacySpeechStage and CompositeStage, integrating cleanly with the existing executor framework (Xenna, Ray). The manifest reader uses fsspec for cloud storage support (S3, GCS) and avoids Pandas to prevent memory blow-up with deeply nested audio metadata.

Issues Noted in Previous Reviews

Several issues have been flagged in previous review threads that should be addressed before merge (see "Previous Threads" section). The most critical are tuple order inconsistencies in the overlap stage, brittle test assertions, and threshold semantics that may not match documentation.

Confidence Score: 3/5

  • This PR adds substantial new functionality that is well-tested and documented, but has several code quality and maintainability issues flagged in previous review threads that should be addressed.
  • The core windowing and overlap filtering logic appears sound with 38 comprehensive tests. However, multiple issues identified in previous threads affect code quality: confusing tuple order convention requiring compensating swaps throughout the overlap stage, brittle test assertions with hardcoded golden values, hardcoded bandwidth threshold not using the configurable parameter, and potentially inverted threshold semantics. These don't appear to be critical correctness bugs, but they impact maintainability and could cause issues during future refactoring or when users configure non-default parameters.
  • Pay close attention to nemo_curator/stages/audio/alm/alm_data_overlap.py (tuple order inconsistency), nemo_curator/stages/audio/alm/alm_data_builder.py (hardcoded bandwidth value), and the integration test files with brittle assertions.

Important Files Changed

Filename Overview
nemo_curator/stages/audio/alm/alm_data_builder.py Implements window creation from audio segments with quality filtering (sample rate, bandwidth, speaker count, duration). Core windowing logic is sound but has a hardcoded 8000 bandwidth threshold in diagnostic code (line 112) instead of using min_bandwidth parameter.
nemo_curator/stages/audio/alm/alm_data_overlap.py Filters overlapping windows based on threshold. Has confusing tuple order convention (stores as (end, start) but processes as (start, end) with swaps throughout), and threshold semantics may be inverted from documentation expectations.
nemo_curator/stages/audio/alm/alm_manifest_reader.py CompositeStage for reading JSONL manifests line-by-line without Pandas. Clean implementation with proper memory handling. Log message on line 53 shows cumulative entry count across all manifests rather than per-manifest count.
nemo_curator/stages/audio/alm/alm_manifest_writer.py Writes AudioBatch entries to JSONL manifest with proper single-writer constraint. Clean implementation with good fsspec integration for cloud storage support.
tests/stages/audio/alm/test_alm_data_builder.py Unit and integration tests for builder stage with good coverage. Integration test on line 163 has brittle assertion with hardcoded count (181) that will break on any logic changes or fixture modifications.
tests/stages/audio/alm/test_alm_data_overlap.py Unit and integration tests for overlap stage. Integration test on line 142-144 has brittle assertions with hardcoded counts (181, 25, 3035.50) sensitive to any changes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Input JSONL Manifests<br/>audio segments with<br/>diarization metadata] --> B[ALMManifestReader<br/>CompositeStage]
    B --> C[FilePartitioningStage<br/>discover & partition files]
    C --> D[ALMManifestReaderStage<br/>read line-by-line, no Pandas]
    D --> E[ALMDataBuilderStage<br/>create training windows]
    E --> F{Quality Filters}
    F -->|sample rate < 16kHz| G[Reject entry]
    F -->|bandwidth < 8kHz| H[Reject segment]
    F -->|speakers not 2-5| I[Reject window]
    F -->|duration not 108-132s| J[Reject window]
    F -->|all filters pass| K[Accept window]
    K --> L[ALMDataOverlapStage<br/>filter overlapping windows]
    L --> M{Overlap Check}
    M -->|overlap ratio >= threshold| N[Keep window closer<br/>to target duration]
    M -->|overlap ratio < threshold| O[Keep both windows]
    N --> P[ALMManifestWriterStage<br/>write JSONL output]
    O --> P
    P --> Q[Output JSONL<br/>filtered windows +<br/>stats + durations]
    
    style A fill:#e1f5ff
    style Q fill:#d4edda
    style G fill:#f8d7da
    style H fill:#f8d7da
    style I fill:#f8d7da
    style J fill:#f8d7da
Loading

Last reviewed commit: 09c1ad5

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30 files reviewed, 17 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@karpnv karpnv self-requested a review January 31, 2026 00:35
Copy link
Contributor

@karpnv karpnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments:

  1. Can you add a benchmarking script to benchmarks and share a representative dataset that can be used to run an alm pipeline.
  2. You are already logging many statistics in the stages here, is it possible to also use _log_metrics like done in some of the text stages to log some of these timing metrics so that they can be tracked better to catch regressions?

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@mohammadaaftabv
Copy link
Author

Few comments:

  1. Can you add a benchmarking script to benchmarks and share a representative dataset that can be used to run an alm pipeline.

https://github.com/mohammadaaftabv/Curator/tree/alm_data_build/tests/fixtures/audio/alm is the representative dataset and i am assuming by benchmarks you mean result of running both processors on the representative data, in that case alm data build should build 181 windows based on config in test file and alm data overlap applied on resultant 181 windows with allowing max 50% overlap will give 3035.5 seconds total output.

All this is in test cases here.

@mohammadaaftabv
Copy link
Author

2. You are already logging many statistics in the stages here, is it possible to also use _log_metrics like done in some of the text stages to log some of these timing metrics so that they can be tracked better to catch regressions?

Added _log_metrics calls to both stages, following the pattern in text stages. Now tracking:

  • ALMDataBuilderStage: process_entry_time, segments_processed, windows_created
  • ALMDataOverlapStage: filter_time, input_windows, output_windows"

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile


# Calculate statistics
# Stage 1 output: total_dur_list_window contains the original window count
stage1_windows = sum(len(e.get("total_dur_list_window", e.get("windows", []))) for e in output_entries)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these make sense, but also take a look at Task._metadata and Task._stage_perf_stats if there are things that are relevant

self._drop_fields_set = {f.strip() for f in self.drop_fields.split(",") if f.strip()}
self._drop_fields_top_level_set = {f.strip() for f in self.drop_fields_top_level.split(",") if f.strip()}

def process_dataset_entry(self, data_entry: dict[str, Any]) -> list[AudioBatch]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional that we operate on a single manifest entry at a time? Can any of this be vectorized? Same for other stages

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is intentional — it follows the LegacySpeechStage pattern used by all other audio stages (GetAudioDurationStage, PreserveByValueStage, etc.), where process() iterates over task.data and calls process_dataset_entry() per entry.

Parallelism is handled at the executor level instead. In benchmark testing (10,000 entries, XennaExecutor on 8-core i9-9900KF), the autoscaler allocated 4 workers to the Builder stage, achieving ~1,460 entries/sec aggregate throughput (365 entries/sec/worker) with 86% CPU utilization. The Overlap stage ran 3 workers at ~5,650 entries/sec. Full pipeline completed in 90s.

If we want batch-level optimization in the future, it would need to happen at the LegacySpeechStage base class level, which would affect all audio stages.

}
)

return [AudioBatch(data=[result])]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each time we return a Task you must pass its parents tasks metadata and stage_perf_stats..

In such a fan-out implementation this becomes hard to reason about..

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the _stage_perfs are supposedly propagated via the base LegacySpeechStage. I would be curious to look at the benchmark results for this PR though to get an even better understanding of how existing audio curation code can be refactored.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 26, 2026

Additional Comments (3)

nemo_curator/stages/audio/alm/alm_data_builder.py, line 168
Missing validation for stage parameters. Add checks to ensure:

  • min_speakers <= max_speakers
  • tolerance >= 0 (reasonable range)
  • target_window_duration > 0
  • min_sample_rate > 0

Without validation, invalid configs can cause confusing runtime behavior.


nemo_curator/stages/audio/alm/alm_data_builder.py, line 203
Potential KeyError if segments are missing required fields start, end. Consider adding validation or try/except to handle malformed input data gracefully with a clear error message.


nemo_curator/stages/audio/alm/alm_data_overlap.py, line 62
Non-standard tuple order (end, start) requires compensating index swaps throughout the file (lines 67-68, 87-88, 94-95). This makes the code harder to understand and maintain. Consider using standard (start, end) convention or add a clear comment explaining why this order is intentional.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 27, 2026

Additional Comments (5)

nemo_curator/stages/audio/alm/alm_data_builder.py, line 112
hardcoded 8000 should use self.min_bandwidth for correct diagnostics when users configure different bandwidth thresholds


nemo_curator/stages/audio/alm/alm_data_overlap.py, line 62
returns (end, start) tuple order, requiring index swaps at lines 67-68, 87-88, 94-95 and unpacking as for end, start at lines 121, 126. standard (start, end) order would improve readability

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!


tests/stages/audio/alm/test_alm_data_builder.py, line 163
hardcoded count 181 makes test brittle - will break on fixture changes or logic tweaks. consider asserting total_windows > 0 instead

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!


tests/stages/audio/alm/test_alm_data_overlap.py, line 144
hardcoded counts 181, 25, 3035.50 make tests brittle. consider asserting invariants like total_filtered_windows <= total_builder_windows and filtered_dur > 0

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!


nemo_curator/stages/audio/alm/alm_manifest_reader.py, line 53
logs cumulative total len(entries) across all manifests, not count from current manifest. track per-manifest count separately for accurate logging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-follow-up Issue needs follow-up

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants