-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accurate Starting Timestamps for Zipformer2-Pruned-transducer streaming models #1874
Comments
The audio file is accessible if needed. |
These models are really intended for transcription, the alignment is just a side effect and it's not really something we optimize for. If you are mainly interested in the sentence-level alignment it may be possible to just ignore any frames that coincide with the start and end of the utterance. |
Thank you so much, Dan! Will update here shortly. |
Sure, that might be worth trying. |
Hi guys,
Recently, we have been trying to get starting timestamps for each of non-blank tokens using Zipformer2-Pruned-transducer, but found the time is weirdly inaccurate.
We tested on a 16000Hz audio with length of 2.84s (shown as follows)
We tested on our own trained models, and the officially released pretrained model using librispeech960h as well, but the timestamps from these well-trained models are like:
{ "text": " HOW LONG DOES IT TAKE TO GET A BUKIT BATOK WEST BY TAXI", "tokens": [ " HOW", " LONG", " DOES", " IT", " TAKE", " TO", " GET", " A", " B", "U", "K", "IT", " BA", "TO", "K", " WE", "S", "T", " BY", " T", "A", "X", "I" ], "timestamps": [ 0.00, 0.32, 0.52, 0.68, 0.84, 0.96, 1.08, 1.20, 1.36, 1.40, 1.48, 1.52, 1.72, 1.84, 1.92, 2.12, 2.28, 2.32, 2.56, 2.68, 2.76, 2.80, 3.04 ], "ys_probs": [ -0.005738, -0.004155, -0.698301, -0.005920, -0.004257, -0.758217, -0.001242, -1.111435, -0.001471, -0.209558, -0.005508, -0.009295, -0.005622, -0.004448, -0.362340, -0.000596, -0.000699, -0.417795, -0.000842, -0.001755, -0.001307, -0.012808, -0.274420 ], "lm_probs": [ ], "context_scores": [ ], "segment": 0, "start_time": 0.00, "is_final": false}.
However, when tested on a unwell-trained model, e.g., a model trained with librispeech100h, the timestamps look more accurate, though the transcription is a bit wrong:
{ "text": " ALONG AS IT TO GET IT LOOKED BUT WEST BY TEXY", "tokens": [ " A", "LO", "NG", " AS", " IT", " TO", " GET", " IT", " LOOK", "ED", " BUT", " WE", "ST", " BY", " T", "E", "X", "Y" ], "timestamps": [ 0.16, 0.24, 0.28, 0.44, 0.56, 0.84, 0.96, 1.08, 1.20, 1.40, 1.56, 2.00, 2.12, 2.24, 2.48, 2.52, 2.56, 2.76 ], "ys_probs": [ -0.098348, -0.030746, -0.019178, -0.684411, -0.811485, -0.696930, -0.107058, -1.197881, -1.005888, -0.265035, -0.968925, -0.300151, -0.342162, -0.085697, -0.258904, -0.035624, -0.046419, -0.402710 ], "lm_probs": [ ], "context_scores": [ ], "segment": 0, "start_time": 0.00, "is_final": false}
I looked through the Issue from icefall k2-fsa/sherpa#52, is it just an issue with the transducer architecture?
The text was updated successfully, but these errors were encountered: