Accurate Starting Timestamps for Zipformer2-Pruned-transducer streaming models #1874

pengyizhou · 2025-02-05T06:22:38Z

Hi guys,
Recently, we have been trying to get starting timestamps for each of non-blank tokens using Zipformer2-Pruned-transducer, but found the time is weirdly inaccurate.
We tested on a 16000Hz audio with length of 2.84s (shown as follows)

We tested on our own trained models, and the officially released pretrained model using librispeech960h as well, but the timestamps from these well-trained models are like:
{ "text": " HOW LONG DOES IT TAKE TO GET A BUKIT BATOK WEST BY TAXI", "tokens": [ " HOW", " LONG", " DOES", " IT", " TAKE", " TO", " GET", " A", " B", "U", "K", "IT", " BA", "TO", "K", " WE", "S", "T", " BY", " T", "A", "X", "I" ], "timestamps": [ 0.00, 0.32, 0.52, 0.68, 0.84, 0.96, 1.08, 1.20, 1.36, 1.40, 1.48, 1.52, 1.72, 1.84, 1.92, 2.12, 2.28, 2.32, 2.56, 2.68, 2.76, 2.80, 3.04 ], "ys_probs": [ -0.005738, -0.004155, -0.698301, -0.005920, -0.004257, -0.758217, -0.001242, -1.111435, -0.001471, -0.209558, -0.005508, -0.009295, -0.005622, -0.004448, -0.362340, -0.000596, -0.000699, -0.417795, -0.000842, -0.001755, -0.001307, -0.012808, -0.274420 ], "lm_probs": [ ], "context_scores": [ ], "segment": 0, "start_time": 0.00, "is_final": false}.

However, when tested on a unwell-trained model, e.g., a model trained with librispeech100h, the timestamps look more accurate, though the transcription is a bit wrong:
{ "text": " ALONG AS IT TO GET IT LOOKED BUT WEST BY TEXY", "tokens": [ " A", "LO", "NG", " AS", " IT", " TO", " GET", " IT", " LOOK", "ED", " BUT", " WE", "ST", " BY", " T", "E", "X", "Y" ], "timestamps": [ 0.16, 0.24, 0.28, 0.44, 0.56, 0.84, 0.96, 1.08, 1.20, 1.40, 1.56, 2.00, 2.12, 2.24, 2.48, 2.52, 2.56, 2.76 ], "ys_probs": [ -0.098348, -0.030746, -0.019178, -0.684411, -0.811485, -0.696930, -0.107058, -1.197881, -1.005888, -0.265035, -0.968925, -0.300151, -0.342162, -0.085697, -0.258904, -0.035624, -0.046419, -0.402710 ], "lm_probs": [ ], "context_scores": [ ], "segment": 0, "start_time": 0.00, "is_final": false}

I looked through the Issue from icefall k2-fsa/sherpa#52, is it just an issue with the transducer architecture?

pengyizhou · 2025-02-05T06:35:23Z

The audio file is accessible if needed.

danpovey · 2025-02-08T05:53:55Z

These models are really intended for transcription, the alignment is just a side effect and it's not really something we optimize for. If you are mainly interested in the sentence-level alignment it may be possible to just ignore any frames that coincide with the start and end of the utterance.
It's also possible, perhaps likely, that models trained with CTC and/or CR-CTC have more accurate alignments. Our most recent example scripts (I think you'll find them in the RESULTS.md or README.md) have this.

pengyizhou · 2025-02-08T06:03:58Z

These models are really intended for transcription, the alignment is just a side effect and it's not really something we optimize for. If you are mainly interested in the sentence-level alignment it may be possible to just ignore any frames that coincide with the start and end of the utterance. It's also possible, perhaps likely, that models trained with CTC and/or CR-CTC have more accurate alignments. Our most recent example scripts (I think you'll find them in the RESULTS.md or README.md) have this.

Thank you so much, Dan!
We are now moving from pruned-transducer to architectures w/ CR-CTC.
Since we have a well-trained Zipformer-transducer model, we are trying to determine whether we can add a CTC on top of the trained model (freeze parameters first, then unfreeze after several epochs) and tune the CTC w/ CR loss directly to reduce the overall training time.

Will update here shortly.

danpovey · 2025-02-08T07:18:24Z

Sure, that might be worth trying.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accurate Starting Timestamps for Zipformer2-Pruned-transducer streaming models #1874

Accurate Starting Timestamps for Zipformer2-Pruned-transducer streaming models #1874

pengyizhou commented Feb 5, 2025

pengyizhou commented Feb 5, 2025

danpovey commented Feb 8, 2025

pengyizhou commented Feb 8, 2025

danpovey commented Feb 8, 2025

Accurate Starting Timestamps for Zipformer2-Pruned-transducer streaming models #1874

Accurate Starting Timestamps for Zipformer2-Pruned-transducer streaming models #1874

Comments

pengyizhou commented Feb 5, 2025

pengyizhou commented Feb 5, 2025

danpovey commented Feb 8, 2025

pengyizhou commented Feb 8, 2025

danpovey commented Feb 8, 2025