[MagpieTTS][bugfix] reset kv cache for longform inference and add missing utmosv2 score by XuesongYang · Pull Request #15385 · NVIDIA-NeMo/NeMo

XuesongYang · 2026-02-11T18:23:25Z

Summary

Two inference bugfixes for MagpieTTS.

1. Reset KV cache at start of longform inference batch

generate_long_form_speech never reset the decoder KV cache. When the inference script
processes multiple datasets sequentially (e.g., a non-longform dataset followed by a longform
dataset), the prior generate_speech call leaves use_cache=True with populated tensors.
The longform path then inherits this stale cache, causing a RuntimeError: Sizes of tensors must match in torch.cat during self-attention KV concatenation.

Fix: call reset_cache(use_cache=self.model.use_kv_cache_for_inference) at the start of each
longform batch in _run_longform_inference, matching the pattern used by infer_batch.

Error Details:

[NeMo I 2026-02-11 03:24:13 inference:317] Using longform inference path
[NeMo I 2026-02-11 03:24:13 inference:459] Cleaning up old generated files in: /results/moe16_sinkhorn_top1_valLoss5.0469_step2625132_epoch524_decoder-MoE_16x1_d3072_sinkhorn_Temp0.7_Topk80_Cfg_True_2.5_Prior_True_0.1_5_0_None_None_LT_False_MaskGit_3_None_None_EOS_argmax_or_multinomial_any_IgnoreFST_False_SV_titanet_libritts_seen/audio/repeat_0
[NeMo I 2026-02-11 03:24:14 inference:602] Processing batch 1/6 (longform)
[NeMo I 2026-02-11 03:24:15 magpietts:4621] Longform decoding timestep 0
Traceback (most recent call last):
  File "/code/examples/tts/magpietts_inference.py", line 668, in <module>
    main()
  File "/code/examples/tts/magpietts_inference.py", line 638, in main
    cer, ssim = run_inference_and_evaluation(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/examples/tts/magpietts_inference.py", line 257, in run_inference_and_evaluation
    rtf_metrics_list, _, codec_file_paths = runner.run_inference_on_dataset(
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/nemo/collections/tts/modules/magpietts_inference/inference.py", line 318, in run_inference_on_dataset
    return self._run_longform_inference(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/nemo/collections/tts/modules/magpietts_inference/inference.py", line 646, in _run_longform_inference
    output = self.model.generate_long_form_speech(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/nemo/collections/tts/models/magpietts.py", line 4650, in generate_long_form_speech
    all_code_logits, attn_probs, dec_out = self._run_longform_forward_with_cfg(
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/nemo/collections/tts/models/magpietts.py", line 4321, in _run_longform_forward_with_cfg
    combined_logits, attn_probs, dec_out, _ = self.forward(
                                              ^^^^^^^^^^^^^
  File "/code/nemo/collections/tts/models/magpietts.py", line 1262, in forward
    decoder_out = self.decoder(
                  ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)                                                                                                                                                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/nemo/collections/tts/modules/transformer_2501.py", line 826, in forward
    out_dict = layer(x, x_mask, _cond, _cond_mask, attn_prior=_attn_prior)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/nemo/collections/tts/modules/transformer_2501.py", line 577, in forward
    x_, s_attn_prob = self.self_attention(query=self.norm_self(x), query_mask=x_mask)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/nemo/collections/tts/modules/transformer_2501.py", line 300, in forward
    y, attn_prob = self.attn_naive(query, query_mask, memory, memory_mask, attn_prior)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/nemo/collections/tts/modules/transformer_2501.py", line 222, in attn_naive
    q, k, v, mask = self.compute_qkv_and_mask(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/nemo/collections/tts/modules/transformer_2501.py", line 358, in compute_qkv_and_mask
    k = torch.cat([self.cache['self_k'], k], dim=1)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 6 but got size 64 for tensor number 1 in the list.

2. Save filewise utmosv2 score in evaluation output

The utmosv2 metric was computed per file but not included in the saved filewise metrics
JSON, so downstream visualization (box plots) could not display MOS scores.

Fix: add 'utmosv2' to filewise_metrics_keys_to_save in evaluate_generated_audio.py.

Error Details:

…nt stale cache from prior batch or datasets Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

… display MOS. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

blisc · 2026-02-11T19:27:36Z

nemo/collections/tts/modules/magpietts_inference/evaluate_generated_audio.py

        'gt_audio_filepath',
        'pred_audio_filepath',
        'context_audio_filepath',
+        'utmosv2',


This will be added in #15381, please remove from yours

I've reviewed the other PR and don't anticipate any conflicts during a rebase. I suggest we avoid reverting the commit here. Instead, let's simply merge whichever PR is ready first, and then rebase the remaining one.

blisc · 2026-02-12T14:26:39Z

@subhankar-ghosh please review

blisc · 2026-02-18T20:14:36Z

Drafting since we plan to add this to #15375

XuesongYang · 2026-02-23T18:15:34Z

Drafting since we plan to add this to #15375

let's close this PR and move our discussion to that PR.

XuesongYang added 2 commits February 11, 2026 10:14

[bugfix] reset KV cache at start of longform inference batch to preve…

2221ede

…nt stale cache from prior batch or datasets Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

[bugfix] filewise utmosv2 score is skipped so that box plot would not…

3da3545

… display MOS. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Copilot AI review requested due to automatic review settings February 11, 2026 18:23

github-actions bot added the TTS label Feb 11, 2026

XuesongYang requested review from blisc and subhankar-ghosh February 11, 2026 18:23

Copilot started reviewing on behalf of XuesongYang February 11, 2026 18:24 View session

XuesongYang requested a review from rlangman February 11, 2026 18:24

XuesongYang added the Run CICD label Feb 11, 2026

XuesongYang temporarily deployed to test February 11, 2026 18:25 — with GitHub Actions Inactive

Copilot AI reviewed Feb 11, 2026

View reviewed changes

blisc reviewed Feb 11, 2026

View reviewed changes

blisc marked this pull request as draft February 18, 2026 20:13

XuesongYang closed this Feb 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[MagpieTTS][bugfix] reset kv cache for longform inference and add missing utmosv2 score #15385

[MagpieTTS][bugfix] reset kv cache for longform inference and add missing utmosv2 score #15385
XuesongYang wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
XuesongYang:xueyang/pr-bugfix-stale-kvcache

XuesongYang commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

blisc Feb 11, 2026

Uh oh!

XuesongYang Feb 11, 2026

Uh oh!

blisc commented Feb 12, 2026

Uh oh!

blisc commented Feb 18, 2026

Uh oh!

XuesongYang commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

XuesongYang commented Feb 11, 2026

Summary

1. Reset KV cache at start of longform inference batch

2. Save filewise utmosv2 score in evaluation output

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

blisc Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

XuesongYang Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

blisc commented Feb 12, 2026

Uh oh!

blisc commented Feb 18, 2026

Uh oh!

XuesongYang commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants