Audio & Speech Generation Paper List

A curated list of papers and resources related to Speech & Audio Generation. This project is just starting and still requires a lot of work. So feel free to contribute!

Paper

Text to Speech (TTS)

Voicebox: Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale (2023-06, Meta) [Unofficial code, Blog/ Demo]
VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (2023-01, Microsoft) [Unofficial code]
Your TTS: YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone (ICML 2022) [Code]
FastSpeech2: FastSpeech 2: Fast and High-Quality End-to-End Text to Speech (ICLR 2021) [Unofficial Code] A non-autoregressive text-to-speech (TTS) model designed to more effectively address the one-to-many mapping challenge in TTS, while outperforming autoregressive models in terms of voice quality.

Voice Conversion (VC)

Your TTS: YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone (ICML 2022) [Code]

Audio Generation and Text to Audio (TTA)

Notes that actually many audio generation models are also able to generate speech.

Singing Voice Synthesis (SVS)

Speech to Speech Translation (S2ST/ STST)

Seamless: Seamless: Multilingual Expressive and Streaming Speech Translation (2023-12, Meta) [Official Code & Model]
Translatotron 3: Translatotron 3: Speech to Speech Translation with Monolingual Data (2023-05, Google) [Demo]
VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling (2023-03, Microsoft) [Demo, Unofficial Code]

Streaming & Simultaneous Translation

STACL: STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework (ACL 2019) [Demo]

Speech Translation Dataset

CoVoST 2: CoVoST 2 and Massively Multilingual Speech Translation (Interspeech 2021, Meta) [Official Code & Data]

Text to Music(TTM)

Large Language Model(LLM)

AudioPaLM: AudioPaLM: A Large Language Model That Can Speak and Listen (2023-06, Google) [Demo] A large language model which fuses text-based and speech-based language models, PaLM-2 and AudioLM, into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
AudioLM: AudioLM: a language modeling approach to audio generation (TASLP 2023, Google) [Code, Blog, Demo] AudioLM learns to generate natural and coherent continuations given short prompts.

Software/ Libraries

Speech Synthesis

BERT-VITS2: A TTS tool shows great performance on Chinese speech synthesis.
Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit. The Goal of Amphion is to offer a platform for studying the conversion of any inputs into audio. (TTS, SVS, VC, SVC, TTA, TTM) [Paper, Video(Chinese)]
Speech Brain: A PyTorch-based Speech Toolkit.
ESPNet: An End-to-End Speech Processing Toolkit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Audio & Speech Generation Paper List

Paper

Text to Speech (TTS)

Voice Conversion (VC)

Audio Generation and Text to Audio (TTA)

Singing Voice Synthesis (SVS)

Speech to Speech Translation (S2ST/ STST)

Streaming & Simultaneous Translation

Speech Translation Dataset

Text to Music(TTM)

Large Language Model(LLM)

Software/ Libraries

Speech Synthesis

Files

README.md

Latest commit

History

README.md

File metadata and controls

Audio & Speech Generation Paper List

Paper

Text to Speech (TTS)

Voice Conversion (VC)

Audio Generation and Text to Audio (TTA)

Singing Voice Synthesis (SVS)

Speech to Speech Translation (S2ST/ STST)

Streaming & Simultaneous Translation

Speech Translation Dataset

Text to Music(TTM)

Large Language Model(LLM)

Software/ Libraries

Speech Synthesis