Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question About h_bert in paper and bert_dur & d_en in implementation. #307

Open
littlestronomer opened this issue Jan 25, 2025 · 0 comments

Comments

@littlestronomer
Copy link

In the paper, it is said that the h_bert is the output of the Prosodic Text Encoder module, however, in the code, I believe it supposed to be variable d_en which is not the variable that being fed into other components.

train_second.py
309 bert_dur = model.bert(texts, attention_mask=(~text_mask).int())
310 d_en = model.bert_encoder(bert_dur).transpose(-1, -2)

In the implementation, bert_dur acts like h_bert in the paper but as I said above, in the paper, h_bert must bu Prosodic Text Encoder's output, which is d_en in the implementation I believe.

321 s_preds = sampler(noise = torch.randn_like(s_trg).unsqueeze(1).to(device),
322 embedding=bert_dur,
323 embedding_scale=1,
324 features=ref, # reference from the same speaker as the embedding
325 embedding_mask_proba=0.1,
326 num_steps=num_steps).squeeze(1)
327 loss_diff = model.diffusion(s_trg.unsqueeze(1), embedding=bert_dur, features=ref).mean() # EDM loss
328 loss_sty = F.l1_loss(s_preds, s_trg.detach()) # style reconstruction loss

Am I missing something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant