Attention Mask for T5 / T2V Attention #686

karan-dalal · 2025-01-24T09:01:51Z

N/A

N/A

Hi. I have a few questions on the way text was integrated in the T2V model.

When producing the text embeddings, you pad the tokens to the self.max_length first. Then, you forward through the T5 encoder:

Line 276 in bbe909d

outputs = self.transformer(input_ids=tokens)

How come you don't include an attention mask? Right now, the T5 encoder is able to attend to padded tokens during the encoding process.

The better thing to do would be:
outputs = model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])

When training the DiT, you concat the text to the video embedding and pass it through attention. But your attention mask is set to all ones:

Line 855 in bbe909d

    
           kwargs["input_ids"] = kwargs["position_ids"] = kwargs["attention_mask"] = torch.ones((1, 1)).to(x.dtype)

Because your text is right padded, you are doing attention over [Text, Pad Tokens, Video Embedding] – why do you not use an attention mask?

The text was updated successfully, but these errors were encountered:

zRzRzRzRzRzRzR self-assigned this Jan 24, 2025

Provide feedback