numerical difference for SDPA between non-dtensor vs dtensor, when math attention and fp16 are used

Higher loss (9.5602 vs. 9.3164) was observed for the dtensor case, after 10 steps on the llama2 debug model. This happens even without applying rotary embedding, and the complex number multiplication issue mentioned in #267.

Note: to apply math attention with dtensor, one needs to set `_allow_implicit_replication` to true (because a non-dtensor mask will be generated if `is_causal=True` for SDPA).

This issue doesn't seem to be urgent, as math attention is only a fallback option for flash attention and memory-efficient attention.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

numerical difference for SDPA between non-dtensor vs dtensor, when math attention and fp16 are used #317

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

numerical difference for SDPA between non-dtensor vs dtensor, when math attention and fp16 are used #317

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions