Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accurate Starting Timestamps for Zipformer2-Pruned-transducer streaming models #1874

Open
pengyizhou opened this issue Feb 5, 2025 · 4 comments

Comments

@pengyizhou
Copy link

Hi guys,
Recently, we have been trying to get starting timestamps for each of non-blank tokens using Zipformer2-Pruned-transducer, but found the time is weirdly inaccurate.
We tested on a 16000Hz audio with length of 2.84s (shown as follows)

Image

We tested on our own trained models, and the officially released pretrained model using librispeech960h as well, but the timestamps from these well-trained models are like:
{ "text": " HOW LONG DOES IT TAKE TO GET A BUKIT BATOK WEST BY TAXI", "tokens": [ " HOW", " LONG", " DOES", " IT", " TAKE", " TO", " GET", " A", " B", "U", "K", "IT", " BA", "TO", "K", " WE", "S", "T", " BY", " T", "A", "X", "I" ], "timestamps": [ 0.00, 0.32, 0.52, 0.68, 0.84, 0.96, 1.08, 1.20, 1.36, 1.40, 1.48, 1.52, 1.72, 1.84, 1.92, 2.12, 2.28, 2.32, 2.56, 2.68, 2.76, 2.80, 3.04 ], "ys_probs": [ -0.005738, -0.004155, -0.698301, -0.005920, -0.004257, -0.758217, -0.001242, -1.111435, -0.001471, -0.209558, -0.005508, -0.009295, -0.005622, -0.004448, -0.362340, -0.000596, -0.000699, -0.417795, -0.000842, -0.001755, -0.001307, -0.012808, -0.274420 ], "lm_probs": [ ], "context_scores": [ ], "segment": 0, "start_time": 0.00, "is_final": false}.

However, when tested on a unwell-trained model, e.g., a model trained with librispeech100h, the timestamps look more accurate, though the transcription is a bit wrong:
{ "text": " ALONG AS IT TO GET IT LOOKED BUT WEST BY TEXY", "tokens": [ " A", "LO", "NG", " AS", " IT", " TO", " GET", " IT", " LOOK", "ED", " BUT", " WE", "ST", " BY", " T", "E", "X", "Y" ], "timestamps": [ 0.16, 0.24, 0.28, 0.44, 0.56, 0.84, 0.96, 1.08, 1.20, 1.40, 1.56, 2.00, 2.12, 2.24, 2.48, 2.52, 2.56, 2.76 ], "ys_probs": [ -0.098348, -0.030746, -0.019178, -0.684411, -0.811485, -0.696930, -0.107058, -1.197881, -1.005888, -0.265035, -0.968925, -0.300151, -0.342162, -0.085697, -0.258904, -0.035624, -0.046419, -0.402710 ], "lm_probs": [ ], "context_scores": [ ], "segment": 0, "start_time": 0.00, "is_final": false}

I looked through the Issue from icefall k2-fsa/sherpa#52, is it just an issue with the transducer architecture?

@pengyizhou
Copy link
Author

The audio file is accessible if needed.

@danpovey
Copy link
Collaborator

danpovey commented Feb 8, 2025

These models are really intended for transcription, the alignment is just a side effect and it's not really something we optimize for. If you are mainly interested in the sentence-level alignment it may be possible to just ignore any frames that coincide with the start and end of the utterance.
It's also possible, perhaps likely, that models trained with CTC and/or CR-CTC have more accurate alignments. Our most recent example scripts (I think you'll find them in the RESULTS.md or README.md) have this.

@pengyizhou
Copy link
Author

These models are really intended for transcription, the alignment is just a side effect and it's not really something we optimize for. If you are mainly interested in the sentence-level alignment it may be possible to just ignore any frames that coincide with the start and end of the utterance. It's also possible, perhaps likely, that models trained with CTC and/or CR-CTC have more accurate alignments. Our most recent example scripts (I think you'll find them in the RESULTS.md or README.md) have this.

Thank you so much, Dan!
We are now moving from pruned-transducer to architectures w/ CR-CTC.
Since we have a well-trained Zipformer-transducer model, we are trying to determine whether we can add a CTC on top of the trained model (freeze parameters first, then unfreeze after several epochs) and tune the CTC w/ CR loss directly to reduce the overall training time.

Will update here shortly.

@danpovey
Copy link
Collaborator

danpovey commented Feb 8, 2025

Sure, that might be worth trying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants