SpeakStream: Streaming Text-to-Speech with Interleaved Data

In this work, we present a streaming TTS (SpeakStream) that can generate audio incrementally from streaming text using a decoder-only architecture. The model is trained using next-step prediction loss on force-aligned, interleaved text-speech data. During inference SpeakStream generates speech incrementally while absorbing streaming text, making it suitable for cascaded conversational AI agents where an LLM streams text to a TTS system. Our experiments show that SpeakStream matches non-streaming TTS quality while enabling streaming capabilities.

Generated Samples

The repository provides examples of the generated speech for models trained on LJSpeech dataset.

License

Repository is released under LICENSE.
All generated speech samples are licensed under Creative Commons Attribution-Noncommercial-Nonderivatives 4.0 International License

Citations

@article{bai2025speakstream,
  title={SpeakStream: Streaming Text-to-Speech with Interleaved Data},
  author={Bai, He and Gu, Zijin and Likhomanenko, Tatiana and Jaitly, Navdeep},
  journal={arXiv preprint arXiv:2505.19206},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
samples/audio		samples/audio
.nojekyll		.nojekyll
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
index.css		index.css
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SpeakStream: Streaming Text-to-Speech with Interleaved Data

Generated Samples

License

Citations

About

Uh oh!

Uh oh!

Languages

License

apple/speakstream-demo

Folders and files

Latest commit

History

Repository files navigation

SpeakStream: Streaming Text-to-Speech with Interleaved Data

Generated Samples

License

Citations

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages