Add option to build CUDA backend without Flash attention

              @slaren Honestly, I think Flash Attention should be an optional feature in ggml since it doesn't introduce significant performance improvements, and the binary size has increased considerably—not to mention the compilation time, which, even though I only compile it for my GPU architecture, still takes 20 minutes on an i5-12400. It is not related to this PR, but it would be good to take it into account.

_Originally posted by @FSSRepo in https://github.com/ggml-org/llama.cpp/issues/11867#issuecomment-2665873267_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add option to build CUDA backend without Flash attention #11946

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add option to build CUDA backend without Flash attention #11946

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions