update link for vocoders

yerfor · May 13, 2022 · 0ffda74 · 0ffda74
1 parent 8e4249e
commit 0ffda74
Show file tree

Hide file tree

Showing 5 changed files with 110 additions and 26 deletions.
diff --git a/README-zh.md b/README-zh.md
@@ -18,9 +18,9 @@
 
 搭建环境
 
-```
+```bash
 conda create -n synta python=3.7
-source activate synta
+conda activate synta
 pip install -U pip
 pip install Cython numpy==1.19.1
 pip install torch==1.9.0 
@@ -29,7 +29,6 @@ pip install -r requirements.txt
 pip install dgl-cu102 dglgo -f https://data.dgl.ai/wheels/repo.html 
 sudo apt install -y sox libsox-fmt-mp3
 bash mfa_usr/install_mfa.sh # install force alignment tools
-
 ```
 
 ## 运行 SyntaSpeech!
@@ -40,19 +39,19 @@ bash mfa_usr/install_mfa.sh # install force alignment tools
 
 #### 准备数据集
 
-您可以直接使用我们的处理好的[LJSpeech数据集](https://drive.google.com/file/d/1WfErAxKqMluQU3vupWS6VB6NdehXwCKM/view?usp=sharing)和 [Biaobei](https://drive.google.com/file/d/1-ApEbBrW5kfF0jM18EmW7DCsll-c1ROp/view?usp=sharing)数据集。 从链接给的谷歌云盘里下载它们并将它们解压缩到 `data/binary/` 文件夹中。
+您可以直接使用我们的处理好的[LJSpeech数据集](https://drive.google.com/file/d/1WfErAxKqMluQU3vupWS6VB6NdehXwCKM/view?usp=sharing)和 [Biaobei](https://drive.google.com/file/d/1n_7NaGCiyieG5TTsPznI1tpHE9q3x9yt/view?usp=sharing)数据集。 从链接给的谷歌云盘里下载它们并将它们解压缩到 `data/binary/` 文件夹中。
 
 至于 LibriTTS，您可以下载原始数据集并使用我们的“data_gen”模块对其进行处理。 详细说明可以在 [dosc/prepare_data](docs/prepare_data.md) 中找到。
 
 #### 准备声码器
 
-我们为三个数据集提供了预训练的声码器模型。 具体来说，Hifi-GAN 用于 [LJSpeech]() 和 [Biaobei]()，ParallelWaveGAN 用于 [LibriTTS]()。 将它们下载并解压到 `checkpoints/`文件夹。
+我们为三个数据集提供了预训练的声码器模型。 具体来说，Hifi-GAN 用于 [LJSpeech](https://drive.google.com/file/d/1D8ABD4fa7TK6t_ymzzhtxsWHPhg7OXcG/view?usp=sharing) 和 [Biaobei](https://drive.google.com/file/d/1onZbPA7rjR1ibmyV1Z-7G22j2Nekiic5/view?usp=sharing)，ParallelWaveGAN 用于 [LibriTTS](https://drive.google.com/file/d/1AziBns4R6UDtrAWaIBRm5hWg9io38EWh/view?usp=sharing)。 将它们下载并解压到 `checkpoints/`文件夹。
 
 ### 2. 开始训练!
 
 然后你可以在三个数据集中训练 SyntaSpeech。
 
-```
+```bash
 cd <the root_dir of your SyntaSpeech folder>
 export PYTHONPATH=./
 CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset # training in LJSpeech
@@ -62,23 +61,25 @@ CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml -
 
 ### 3. Tensorboard
 
-```
+```bash
 tensorboard --logdir=checkpoints/lj_synta
 tensorboard --logdir=checkpoints/biaobei_synta
 tensorboard --logdir=checkpoints/libritts_synta
 ```
 
 ### 4. 模型推理
 
-```
+```bash
 CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset --infer # inference in LJSpeech
 CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset --infer # inference in Biaobei
 CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset ---infer # inference in LibriTTS
 ```
 
 ## 音频演示
 
-音频样本可以在我们的 [demo page](https://syntaspeech.github.io/) 中找到。
+论文中的音频样本可以在我们的 [demo page](https://syntaspeech.github.io/) 中找到。
+
+我们还为 LJSpeech 提供 [HuggingFace 演示页面](https://huggingface.co/spaces/NATSpeech/PortaSpeech)。 你可以在那里尝试你有趣的句子！
 
 ## 引用
 

diff --git a/README.md b/README.md
@@ -18,9 +18,9 @@ Our SyntaSpeech is built on the basis of  [PortaSpeech](https://github.com/NATSp
 
 ## Environments
 
-```
+```bash
 conda create -n synta python=3.7
-source activate synta
+condac activate synta
 pip install -U pip
 pip install Cython numpy==1.19.1
 pip install torch==1.9.0 
@@ -29,7 +29,6 @@ pip install -r requirements.txt
 pip install dgl-cu102 dglgo -f https://data.dgl.ai/wheels/repo.html 
 sudo apt install -y sox libsox-fmt-mp3
 bash mfa_usr/install_mfa.sh # install force alignment tools
-
 ```
 
 ## Run SyntaSpeech!
@@ -46,13 +45,13 @@ As for LibriTTS, you can download the raw datasets and process them with our `da
 
 #### Vocoder Preparation
 
-We provide the pre-trained model of vocoders for three datasets. Specifically, Hifi-GAN for [LJSpeech]() and [Biaobei](), ParallelWaveGAN for [LibriTTS](). Download and unzip them into the `checkpoints/` folder.
+We provide the pre-trained model of vocoders for three datasets. Specifically, Hifi-GAN for [LJSpeech](https://drive.google.com/file/d/1D8ABD4fa7TK6t_ymzzhtxsWHPhg7OXcG/view?usp=sharing) and [Biaobei](https://drive.google.com/file/d/1onZbPA7rjR1ibmyV1Z-7G22j2Nekiic5/view?usp=sharing), ParallelWaveGAN for [LibriTTS](https://drive.google.com/file/d/1AziBns4R6UDtrAWaIBRm5hWg9io38EWh/view?usp=sharing). Download and unzip them into the `checkpoints/` folder.
 
 ### 2. Training Example
 
 Then you can train SyntaSpeech in the three datasets.
 
-```
+```bash
 cd <the root_dir of your SyntaSpeech folder>
 export PYTHONPATH=./
 CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset # training in LJSpeech
@@ -62,23 +61,25 @@ CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml -
 
 ### 3. Tensorboard
 
-```
+```bash
 tensorboard --logdir=checkpoints/lj_synta
 tensorboard --logdir=checkpoints/biaobei_synta
 tensorboard --logdir=checkpoints/libritts_synta
 ```
 
 ### 4. Inference Example
 
-```
+```bash
 CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset --infer # inference in LJSpeech
 CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset --infer # inference in Biaobei
 CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset ---infer # inference in LibriTTS
 ```
 
 ## Audio Demos
 
-Audio samples can be found in our [demo page](https://syntaspeech.github.io/).
+Audio samples in the paper can be found in our [demo page](https://syntaspeech.github.io/).
+
+We also provide [HuggingFace Demo Page](https://huggingface.co/spaces/NATSpeech/PortaSpeech) for LJSpeech. Try your interesting sentences there!
 
 ## Citation
 

diff --git a/inference/tts/base_tts_infer.py b/inference/tts/base_tts_infer.py
@@ -62,7 +62,8 @@ def preprocess_input(self, inp):
         ph_token = self.ph_encoder.encode(ph)
         spk_id = self.spk_map[spk_name]
         item = {'item_name': item_name, 'text': txt, 'ph': ph, 'spk_id': spk_id,
-                'ph_token': ph_token, 'word_token': word_token, 'ph2word': ph2word}
+                'ph_token': ph_token, 'word_token': word_token, 'ph2word': ph2word,
+                'ph_words':ph_gb_word, 'words': word}
         item['ph_len'] = len(item['ph_token'])
         return item
 
@@ -105,9 +106,14 @@ def example_run(cls):
         from utils.audio.io import save_wav
 
         set_hparams()
-        inp = {
-            'text': 'the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.'
-        }
+        if hp['ds_name'] in ['lj', 'libritts']:
+            inp = {
+                'text': 'the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.'
+            }
+        elif hp['ds_name'] in ['biaobei']:
+            inp = {
+                'text': '如果我想你三遍，天上乌云就散一片。'
+            }
         infer_ins = cls(hp)
         out = infer_ins.infer_once(inp)
         os.makedirs('infer_out', exist_ok=True)

diff --git a/inference/tts/gradio/gradio_settings.yaml b/inference/tts/gradio/gradio_settings.yaml
@@ -1,12 +1,12 @@
-title: 'NATSpeech/PortaSpeech'
+title: 'yerfor/SyntaSpeech'
 description: |
-  Gradio demo for NATSpeech/PortaSpeech. To use it, simply add your audio, or click one of the examples to load them. Note: This space is running on CPU, inference times will be higher.
+  Gradio demo for yerfor/SyntaSpeech. To use it, simply add your audio, or click one of the examples to load them. Note: This space is running on CPU, inference times will be higher.
 article: |
-  Link to <a href='https://github.com/NATSpeech/NATSpeech/blob/main/docs/portaspeech.md' style='color:blue;' target='_blank\'>Github REPO</a>
+  Link to <a href='https://github.com/yerfor/SyntaSpeech' style='color:blue;' target='_blank\'>Github REPO</a>
 example_inputs:
   - |-
     the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.
   - |-
     produced the block books, which were the immediate predecessors of the true printed book,
-inference_cls: inference.tts.ps_flow.PortaSpeechFlowInfer
-exp_name: ps_normal_exp
+inference_cls: inference.tts.synta.SyntaSpeechInfer
+exp_name: lj_synta
diff --git a/inference/tts/synta.py b/inference/tts/synta.py
@@ -0,0 +1,76 @@
+import torch
+from inference.tts.base_tts_infer import BaseTTSInfer
+from modules.tts.syntaspeech.syntaspeech import SyntaSpeech
+from utils.commons.ckpt_utils import load_ckpt
+from utils.commons.hparams import hparams
+
+from modules.tts.syntaspeech.syntactic_graph_buider import Sentence2GraphParser
+
+class SyntaSpeechInfer(BaseTTSInfer):
+    def __init__(self, hparams, device=None):
+        super().__init__(hparams, device)
+        if hparams['ds_name'] in ['biaobei']:
+            self.syntactic_graph_builder = Sentence2GraphParser(language='zh')
+        elif hparams['ds_name'] in ['ljspeech', 'libritts']:
+            self.syntactic_graph_builder = Sentence2GraphParser(language='en')
+
+    def build_model(self):
+        ph_dict_size = len(self.ph_encoder)
+        word_dict_size = len(self.word_encoder)
+        model = SyntaSpeech(ph_dict_size, word_dict_size, self.hparams)
+        load_ckpt(model, hparams['work_dir'], 'model')
+        model.to(self.device)
+        with torch.no_grad():
+            model.store_inverse_all()
+        model.eval()
+        return model
+
+    def input_to_batch(self, item):
+        item_names = [item['item_name']]
+        text = [item['text']]
+        ph = [item['ph']]
+        txt_tokens = torch.LongTensor(item['ph_token'])[None, :].to(self.device)
+        txt_lengths = torch.LongTensor([txt_tokens.shape[1]]).to(self.device)
+        word_tokens = torch.LongTensor(item['word_token'])[None, :].to(self.device)
+        word_lengths = torch.LongTensor([word_tokens.shape[1]]).to(self.device)
+        ph2word = torch.LongTensor(item['ph2word'])[None, :].to(self.device)
+        spk_ids = torch.LongTensor(item['spk_id'])[None, :].to(self.device)
+        dgl_graph, etypes = self.syntactic_graph_builder.parse(item['text'], words=item['words'].split(" "), ph_words=item['ph_words'].split(" "))
+        dgl_graph = dgl_graph.to(self.device)
+        etypes = etypes.to(self.device)
+        batch = {
+            'item_name': item_names,
+            'text': text,
+            'ph': ph,
+            'txt_tokens': txt_tokens,
+            'txt_lengths': txt_lengths,
+            'word_tokens': word_tokens,
+            'word_lengths': word_lengths,
+            'ph2word': ph2word,
+            'spk_ids': spk_ids,
+            'graph_lst': [dgl_graph],
+            'etypes_lst': [etypes]
+        }
+        return batch
+    def forward_model(self, inp):
+        sample = self.input_to_batch(inp)
+        with torch.no_grad():
+            output = self.model(
+                sample['txt_tokens'],
+                sample['word_tokens'],
+                ph2word=sample['ph2word'],
+                word_len=sample['word_lengths'].max(),
+                infer=True,
+                forward_post_glow=True,
+                spk_id=sample.get('spk_ids'),
+                graph_lst=sample['graph_lst'],
+                etypes_lst=sample['etypes_lst']
+            )
+            mel_out = output['mel_out']
+            wav_out = self.run_vocoder(mel_out)
+        wav_out = wav_out.cpu().numpy()
+        return wav_out[0]
+
+
+if __name__ == '__main__':
+    SyntaSpeechInfer.example_run()