A curated list of papers and resources related to Speech & Audio Generation. This project is just starting and still requires a lot of work. So feel free to contribute!
- Voicebox: Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale (2023-06, Meta) [Unofficial code, Blog/ Demo]
- VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (2023-01, Microsoft) [Unofficial code]
- Your TTS: YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone (ICML 2022) [Code]
- FastSpeech2: FastSpeech 2: Fast and High-Quality End-to-End Text to Speech (ICLR 2021) [Unofficial Code] A non-autoregressive text-to-speech (TTS) model designed to more effectively address the one-to-many mapping challenge in TTS, while outperforming autoregressive models in terms of voice quality.
- Your TTS: YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone (ICML 2022) [Code]
Notes that actually many audio generation models are also able to generate speech.
- Seamless: Seamless: Multilingual Expressive and Streaming Speech Translation (2023-12, Meta) [Official Code & Model]
- Translatotron 3: Translatotron 3: Speech to Speech Translation with Monolingual Data (2023-05, Google) [Demo]
- VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling (2023-03, Microsoft) [Demo, Unofficial Code]
- STACL: STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework (ACL 2019) [Demo]
- CoVoST 2: CoVoST 2 and Massively Multilingual Speech Translation (Interspeech 2021, Meta) [Official Code & Data]
- AudioPaLM: AudioPaLM: A Large Language Model That Can Speak and Listen (2023-06, Google) [Demo] A large language model which fuses text-based and speech-based language models, PaLM-2 and AudioLM, into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
- AudioLM: AudioLM: a language modeling approach to audio generation (TASLP 2023, Google) [Code, Blog, Demo] AudioLM learns to generate natural and coherent continuations given short prompts.
- BERT-VITS2: A TTS tool shows great performance on Chinese speech synthesis.
- Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit. The Goal of Amphion is to offer a platform for studying the conversion of any inputs into audio. (TTS, SVS, VC, SVC, TTA, TTM) [Paper, Video(Chinese)]
- Speech Brain: A PyTorch-based Speech Toolkit.
- ESPNet: An End-to-End Speech Processing Toolkit.