Skip to content

apple/speakstream-demo

Repository files navigation

SpeakStream: Streaming Text-to-Speech with Interleaved Data

Arxiv OpenReview

In this work, we present a streaming TTS (SpeakStream) that can generate audio incrementally from streaming text using a decoder-only architecture. The model is trained using next-step prediction loss on force-aligned, interleaved text-speech data. During inference SpeakStream generates speech incrementally while absorbing streaming text, making it suitable for cascaded conversational AI agents where an LLM streams text to a TTS system. Our experiments show that SpeakStream matches non-streaming TTS quality while enabling streaming capabilities.

Model Architecture Latency results

Generated Samples

The repository provides examples of the generated speech for models trained on LJSpeech dataset.

License

Citations

@article{bai2025speakstream,
  title={SpeakStream: Streaming Text-to-Speech with Interleaved Data},
  author={Bai, He and Gu, Zijin and Likhomanenko, Tatiana and Jaitly, Navdeep},
  journal={arXiv preprint arXiv:2505.19206},
  year={2025}
}

About

Streaming Text-to-Speech with Interleaved Data

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks