Question About h_bert in paper and bert_dur & d_en in implementation. #307

littlestronomer · 2025-01-25T00:41:01Z

In the paper, it is said that the h_bert is the output of the Prosodic Text Encoder module, however, in the code, I believe it supposed to be variable d_en which is not the variable that being fed into other components.

train_second.py
309 bert_dur = model.bert(texts, attention_mask=(~text_mask).int())
310 d_en = model.bert_encoder(bert_dur).transpose(-1, -2)

In the implementation, bert_dur acts like h_bert in the paper but as I said above, in the paper, h_bert must bu Prosodic Text Encoder's output, which is d_en in the implementation I believe.

321 s_preds = sampler(noise = torch.randn_like(s_trg).unsqueeze(1).to(device),
322 embedding=bert_dur,
323 embedding_scale=1,
324 features=ref, # reference from the same speaker as the embedding
325 embedding_mask_proba=0.1,
326 num_steps=num_steps).squeeze(1)
327 loss_diff = model.diffusion(s_trg.unsqueeze(1), embedding=bert_dur, features=ref).mean() # EDM loss
328 loss_sty = F.l1_loss(s_preds, s_trg.detach()) # style reconstruction loss

Am I missing something?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question About h_bert in paper and bert_dur & d_en in implementation. #307

Question About h_bert in paper and bert_dur & d_en in implementation. #307

littlestronomer commented Jan 25, 2025

Question About h_bert in paper and bert_dur & d_en in implementation. #307

Question About h_bert in paper and bert_dur & d_en in implementation. #307

Comments

littlestronomer commented Jan 25, 2025