[draft] TTS eval by rfejgin · Pull Request #1237 · NVIDIA-NeMo/Skills

rfejgin · 2026-02-12T01:05:04Z

Creating this PR just to easily view diffs.

Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>

Create a small dummy context wav for requests without context_audio_filepath to prevent dataloader failures (missing d*.wav) and 500s from the unified server.

Avoid KV-cache shape mismatches when batch sizes vary between requests in the unified server.

Route HuggingFace resolve URLs used by NeMo audio codec checkpoints through huggingface_hub download so multi-rank server startup avoids repeated downloads and 429s.

Longform decoding with the transformer cache path can produce sequence-length mismatches; disable cache per request batch to prevent 500s in serve_unified.

Correct HuggingFace resolve URL matching so downloads go through hf_hub_download() and avoid multi-rank 429s.

Stop setting srun --wait by default; allow opt-in via cluster_config.srun_wait_seconds.

Add a large srun --wait for multi-instance runs to override nemo_run's default --wait=60, preventing premature termination when some ranks finish earlier.

Lower Magpie inference runner batch size to reduce memory/latency spikes under multi-instance load.

Use a 1-hour default srun --wait for multi-instance runs to avoid premature task termination when chunk runtimes differ.

Introduce the emergent_tts dataset package with prepare/generate/score helpers and default configs to run EmergentTTS evaluation via NeMo-Skills. Co-authored-by: Cursor <cursoragent@cursor.com>

Install google-genai for EmergentTTS-Eval, run scoring from the dataset base dir so relative paths resolve, and avoid shipping large local caches/data. Document EmergentTTS-Eval usage in nv_tts guide. Co-authored-by: Cursor <cursoragent@cursor.com>

Document dataset preparation (HF_TOKEN) and evaluation workflow, including cloning and patching EmergentTTS-Eval for NVIDIA Inference API judging. Co-authored-by: Cursor <cursoragent@cursor.com>

karpnv and others added 26 commits December 20, 2025 07:00

Added audio requests to vLLM models

99ef50a

Intorduced vLLM_multimodal model to save multimodal outputs

8297aed

Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>

generation.py to respect separate server type for the client

2575267

Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>

Unified server to work with NeMo models not supported by vLLM

b8d95f0

Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>

Magpie TTS backend

66667b0

nv_tts eval scripts

d916e10

Checkpoint + hparams input instead of nemo

14523f7

Per benchmark scoring jobs

4aa3a2d

nv_tts benchmarks and scripts to run them

5372f7e

EOS FIX 8 chunks per node

db37cff

Documentation and comparison script

a11456e

eos config example

e80448e

EAR TTS backend

547c912

EAR TTS config

d318068

Fix MagpieTTS backend when no context audio is provided

18981f9

Create a small dummy context wav for requests without context_audio_filepath to prevent dataloader failures (missing d*.wav) and 500s from the unified server.

Reset MagpieTTS decoder cache per request batch

8b9c22f

Avoid KV-cache shape mismatches when batch sizes vary between requests in the unified server.

Cache HF resolve URL loads in MagpieTTS backend

dfb522f

Route HuggingFace resolve URLs used by NeMo audio codec checkpoints through huggingface_hub download so multi-rank server startup avoids repeated downloads and 429s.

Disable MagpieTTS KV cache to avoid shape mismatches

80a2d7c

Longform decoding with the transformer cache path can produce sequence-length mismatches; disable cache per request batch to prevent 500s in serve_unified.

Fix HF resolve URL caching in MagpieTTS backend

9fe2703

Correct HuggingFace resolve URL matching so downloads go through hf_hub_download() and avoid multi-rank 429s.

Avoid killing multi-instance tasks via srun --wait

136af12

Stop setting srun --wait by default; allow opt-in via cluster_config.srun_wait_seconds.

Override srun wait for multi-instance jobs

8f6d68f

Add a large srun --wait for multi-instance runs to override nemo_run's default --wait=60, preventing premature termination when some ranks finish earlier.

Reduce MagpieTTS inference batch size

c23805d

Lower Magpie inference runner batch size to reduce memory/latency spikes under multi-instance load.

Set multi-instance srun wait to 1 hour

c482412

Use a 1-hour default srun --wait for multi-instance runs to avoid premature task termination when chunk runtimes differ.

Add emergent_tts dataset + eval scripts

5d104d3

Introduce the emergent_tts dataset package with prepare/generate/score helpers and default configs to run EmergentTTS evaluation via NeMo-Skills. Co-authored-by: Cursor <cursoragent@cursor.com>

Fix Emergent scoring deps and paths

52b6599

Install google-genai for EmergentTTS-Eval, run scoring from the dataset base dir so relative paths resolve, and avoid shipping large local caches/data. Document EmergentTTS-Eval usage in nv_tts guide. Co-authored-by: Cursor <cursoragent@cursor.com>

Add emergent_tts README

88bd09c

Document dataset preparation (HF_TOKEN) and evaluation workflow, including cloning and patching EmergentTTS-Eval for NVIDIA Inference API judging. Co-authored-by: Cursor <cursoragent@cursor.com>

rfejgin changed the title ~~[draft, please ignore] TTS eval~~ [draft] TTS eval Feb 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[draft] TTS eval#1237

[draft] TTS eval#1237
rfejgin wants to merge 26 commits intomainfrom
rfejgin/2512_tts_eval_merge

rfejgin commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

rfejgin commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments