We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
N/A
Hi. I have a few questions on the way text was integrated in the T2V model.
When producing the text embeddings, you pad the tokens to the self.max_length first. Then, you forward through the T5 encoder:
self.max_length
CogVideo/sat/sgm/modules/encoders/modules.py
Line 276 in bbe909d
How come you don't include an attention mask? Right now, the T5 encoder is able to attend to padded tokens during the encoding process.
The better thing to do would be: outputs = model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])
outputs = model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])
When training the DiT, you concat the text to the video embedding and pass it through attention. But your attention mask is set to all ones:
CogVideo/sat/dit_video_concat.py
Line 855 in bbe909d
Because your text is right padded, you are doing attention over [Text, Pad Tokens, Video Embedding] – why do you not use an attention mask?
[Text, Pad Tokens, Video Embedding]
The text was updated successfully, but these errors were encountered:
zRzRzRzRzRzRzR
No branches or pull requests
System Info / 系統信息
N/A
Information / 问题信息
Reproduction / 复现过程
N/A
Expected behavior / 期待表现
Hi. I have a few questions on the way text was integrated in the T2V model.
T5 Model
When producing the text embeddings, you pad the tokens to the
self.max_length
first. Then, you forward through the T5 encoder:CogVideo/sat/sgm/modules/encoders/modules.py
Line 276 in bbe909d
How come you don't include an attention mask? Right now, the T5 encoder is able to attend to padded tokens during the encoding process.
The better thing to do would be:
outputs = model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])
Attention in DiT
When training the DiT, you concat the text to the video embedding and pass it through attention. But your attention mask is set to all ones:
CogVideo/sat/dit_video_concat.py
Line 855 in bbe909d
Because your text is right padded, you are doing attention over
[Text, Pad Tokens, Video Embedding]
– why do you not use an attention mask?The text was updated successfully, but these errors were encountered: