Skip to content

[TTS] Infrastructure for parallelization of evaluation (scoring)#15417

Merged
rfejgin merged 28 commits intoNVIDIA-NeMo:mainfrom
rfejgin:magpietts_evaluation_parallelization
Feb 24, 2026
Merged

[TTS] Infrastructure for parallelization of evaluation (scoring)#15417
rfejgin merged 28 commits intoNVIDIA-NeMo:mainfrom
rfejgin:magpietts_evaluation_parallelization

Conversation

@rfejgin
Copy link
Collaborator

@rfejgin rfejgin commented Feb 20, 2026

This PR makes MagpieTTS's evaluation a little faster and much more parallelizable.

Changes:

  • Made the ASR step batched, rather than batch-size-1. We first collect all audios that need ASR and run ASR on them in a batched manner before the main per-sample evaluation loop. This substantially speeds up the ASR part of evaluation.
  • Infrastructure towards multi-GPU evaluation (scoring). That is something we will do (and have prototyped) with NeMo Skills later on. To enable that, evaluation was broken down into two steps: a first step where evaluation of each utterance is independent of other utternaces, and a second step that focuses on parts that require with global state.
  • During refactoring I also removed the constant STANDARD_METRIC_KEYS since the set of metrics on which to compute CIs can be inferred from the metrics themselves, which should be easier to maintain.

So evaluate() has been broken into:

  1. evaluate_dir(): computes metrics for all audios in a given directory+manifest and outputs per-file metrics.
  2. compute_global_metrics(): takes per-file metrics collected (1) and computes global metrics from those. This mostly amounts to computing averages. But it also includes computing FCD computation, since that is not something that can easily be broken down into directory-wise chunks due to the statefulness of the FCD metric.
  3. evaluate(): wrapper that chains (1) and (2) for easy use in NeMo. NeMo skills would call (1) and (2) directly.

Running on a single GPU (local machine), these changes yield ~20% speedup of evaluation (more for larger sets, less for small ones, due to overhead of loading models). The benefit is much larger when evaluating on multiple GPUs in parallel, which we have prototyped in NeMo Skills (and will merge later on).

@vmendelev : adding you just as FYI. I think your existing Skills integration should work as-is after this PR, since the evaluate() API hasn't changed. Later, we can break down how Skills does Magpie scoring into the parallelizable part (evaluate_dir()) and the aggregation step (compute_global_metrics()) – I experimented with that and it seemed to work well.

For debugging NeMo Skills deployment

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
When running the project from a directory different from the repo root
(which happens in NeMo Skills), these paths need to be converted to absolute
paths, which is done in this commit.

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
… in Nemo Skills)

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
@github-actions github-actions bot added the TTS label Feb 20, 2026
- remove g2p path handling (unrelated to parallelization)
- update a comment

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
It has been moved outside of evaluate() since in the NeMo Skills use
case we need the full metrics for chunk-wise scoring and aggregation
at the end.

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
@rfejgin rfejgin changed the title [TTS] Infrastructure for parallel evalution [TTS] Parallel evolution infrastructure Feb 21, 2026
@rfejgin rfejgin changed the title [TTS] Parallel evolution infrastructure [Draft] [TTS] Parallel evolution infrastructure Feb 21, 2026
@rfejgin rfejgin changed the title [Draft] [TTS] Parallel evolution infrastructure [Draft] [TTS] Parallel evaluation infrastructure Feb 21, 2026
- Break evaluation into two steps:
  - evaluate_dir() for directory-level evaluation. Can be run in parallel across multiple directories (e.g. in NeMo Skills)
  - compute_global_metrics() for global metrics aggregation.
- Move model loading to separate function
- Cleanup

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
@rfejgin rfejgin changed the title [Draft] [TTS] Parallel evaluation infrastructure [Draft] [TTS] Evaluation: Batch the ASR; refactor for parallelization Feb 21, 2026
@rfejgin rfejgin marked this pull request as ready for review February 23, 2026 19:28
…jgin/NeMo into magpietts_evaluation_parallelization
logging.warning(f"Metric '{key}' not found in any measurements")
results[key] = "N/A"
continue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check no longer makes sense since the metric names are now inferred from the metrics themselves.



# Define the standard metric keys used in evaluation
STANDARD_METRIC_KEYS = [
Copy link
Collaborator Author

@rfejgin rfejgin Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed for better maintainability - we can infer these names from the metrics themselves.

import torch
from threadpoolctl import threadpool_limits

# If UTMOSv2 cache is not set but HF_HOME is, use an area under HF_HOME for the cache location
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of making evaluation more efficient, we want to ensure UTMOS models don't get re-downloaded each time.

Remove unnecessary NaN default values for metrics.

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
@rfejgin rfejgin marked this pull request as ready for review February 23, 2026 22:34
gt_audio_paths = [_resolve_path(audio_dir, r.get('audio_filepath')) for r in records]
context_audio_paths = [_resolve_path(audio_dir, r.get('context_audio_filepath')) for r in records]

device = "cuda"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, guess we always hard-coded this, but should we remove this hardcode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in latest commit. It's part of EvaluationConfig now, still defaulting to "cuda".

blisc
blisc previously approved these changes Feb 24, 2026
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
@rfejgin rfejgin merged commit f0e64ea into NVIDIA-NeMo:main Feb 24, 2026
131 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants