Audio & Speech Generation Paper List

A curated list of papers and resources related to Speech & Audio Generation. This project is just starting and still requires a lot of work. So feel free to contribute!

Paper

Text to Speech (TTS)

Voicebox: Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale (2023-06, Meta) [Unofficial code, Blog/ Demo]
VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (2023-01, Microsoft) [Unofficial code]
Your TTS: YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone (ICML 2022) [Code]
FastSpeech2: FastSpeech 2: Fast and High-Quality End-to-End Text to Speech (ICLR 2021) [Unofficial Code] A non-autoregressive text-to-speech (TTS) model designed to more effectively address the one-to-many mapping challenge in TTS, while outperforming autoregressive models in terms of voice quality.

Voice Conversion (VC)

Your TTS: YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone (ICML 2022) [Code]

Audio Generation and Text to Audio (TTA)

Notes that actually many audio generation models are also able to generate speech.

Singing Voice Synthesis (SVS)

Text to Music(TTM)

Large Language Model(LLM)

AudioPaLM: AudioPaLM: A Large Language Model That Can Speak and Listen (2023-06, Google) [Demo] A large language model which fuses text-based and speech-based language models, PaLM-2 and AudioLM, into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
AudioLM: AudioLM: a language modeling approach to audio generation (TASLP 2023, Google) [Code, Blog, Demo] AudioLM learns to generate natural and coherent continuations given short prompts.

Software/ Libraries

Speech Synthesis

BERT-VITS2: A TTS tool shows great performance on Chinese speech synthesis.
Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit. The Goal of Amphion is to offer a platform for studying the conversion of any inputs into audio. (TTS, SVS, VC, SVC, TTA, TTM) [Paper, Video(Chinese)]
Speech Brain: A PyTorch-based Speech Toolkit.
ESPNet: An End-to-End Speech Processing Toolkit.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio & Speech Generation Paper List

Paper

Text to Speech (TTS)

Voice Conversion (VC)

Audio Generation and Text to Audio (TTA)

Singing Voice Synthesis (SVS)

Speech to Speech Translation (S2ST/ STST)

Streaming & Simultaneous Translation

Speech Translation Dataset

Text to Music(TTM)

Large Language Model(LLM)

Software/ Libraries

Speech Synthesis

About

Releases

Packages

License

Haulyn5/Audio-Speech-Generation-Paper-List

Folders and files

Latest commit

History

Repository files navigation

Audio & Speech Generation Paper List

Paper

Text to Speech (TTS)

Voice Conversion (VC)

Audio Generation and Text to Audio (TTA)

Singing Voice Synthesis (SVS)

Speech to Speech Translation (S2ST/ STST)

Streaming & Simultaneous Translation

Speech Translation Dataset

Text to Music(TTM)

Large Language Model(LLM)

Software/ Libraries

Speech Synthesis

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages