Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get audio data in real-time #122

Open
t41372 opened this issue Jul 2, 2024 · 5 comments
Open

Get audio data in real-time #122

t41372 opened this issue Jul 2, 2024 · 5 comments

Comments

@t41372
Copy link

t41372 commented Jul 2, 2024

Is there a way to get the audio data while the speech is active before it ends? I want to get the audio data when the speech starts, stream it to the back end in real-time and stop streaming when it ends. It seems like the onFrameProcessed callback only has a probability property.
Thanks

@rahulbansal16
Copy link

Which model will you use in the backend to transcribe it? Does the word error rate increase in that way?

@JettScythe
Copy link

bumping as this is exactly what I need. I already have instances of Whisper that are available for transcription / translation on the back end - but reducing latency in a response means getting chunks transcribed as they appear. I suspect a reasonable "chunk" is one sentence.

@Oudwins
Copy link

Oudwins commented Jan 21, 2025

Bumping this also. I'm also interested in this. I tried to implement it manually but when I concat all the frames from onFrameProcessed (as a test) the resulting audio is quite bad. It seems like this package, under the hood, merges frames with a sliding window approach. So just concatenating the frames causes echo and in general bad audio.

@ricky0123
Copy link
Owner

This is something I would like to add better support for in the future, but it will be a significant addition and I'm still not sure what form I want the API to take. In the mean time, have you tried writing separate code to stream audio to your servers and just using the VAD callbacks for voice activity signals?

@Oudwins
Copy link

Oudwins commented Jan 22, 2025

@ricky0123 yes I tried to implement streaming using the VAD callbacks with an algorithm that I think could work quite well for the library:

  • keep track of the padding frames
  • onSpeechStart keep track of speech frames
  • onSpeech exceeding min frames flush padding and saved speech frames to server and start streaming all new frames to server
  • onSpeechEnd stop streaming new frames to server
  • onVADmissfire empty buffered speech frames

But while implementing this I found the issue described in #186

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants