Support for FlashAttention in Llama2 #584

wszczurekhabana · 2023-12-06T10:13:39Z

What does this PR do?

This PR introduces support for FusedSDPA operator (Flash Attention) in Llama2.

Below are preliminary results from this change:

Model	BS	Max input tokens	Max new tokens	Default performance [tokens/second]	Performance with FusedSDPA [tokens/second]	Throughput improvement over default	Default memory allocated [GB]	Memory allocated with FusedSDPA [GB]	Memory reduction over default [GB]
7B 1x	1	16	4096	124.12	124.130	1.000	14.71	14.69	0.02
	4	16	4096	354.27	354.500	1.001	20.81	20.71	0.10
13B 1x	1	16	4096	68.81	68.840	1.000	27.57	27.54	0.03
	4	16	4096	203.63	203.790	1.001	37.05	36.95	0.10
70B 8x	1	16	100	55.55	56.440	1.016	16.68	16.66	0.02
	40	16	100	1875.95	1893.266	1.009	18.55	17.21	1.34
	1	16	2048	60.23	59.617	0.990	16.76	16.74	0.02
	40	16	2048	1685.81	1686.918	1.001	21.25	20.27	0.98
	60	16	2048	2206.59	2208.290	1.001	24.67	22.09	2.58
	1	16	4096	59.77	59.044	0.988	16.86	16.82	0.04
	40	16	4096	1366.84	1368.552	1.001	24.99	23.50	1.49
	60	16	4096	1689.34	1689.680	1.0	30.09	26.92	3.17

By default this is turned off.

wszczurekhabana · 2023-12-06T10:28:19Z

Hi @regisss , could you take a look at this PR and trigger CI for it?

examples/text-generation/run_generation.py

puneeshkhanna · 2023-12-06T11:31:45Z

@wszczurekhabana - Changes look go to me. Just gave a minor comment for help text and pass the parameter in create_custom_forward() in modeling_llama.py.

optimum/habana/transformers/models/llama/modeling_llama.py

regisss · 2023-12-06T17:01:05Z

Is this the same as #583 ?
cc @mandy-li

mandy-li · 2023-12-06T18:35:20Z

Is this the same as #583 ? cc @mandy-li

@regisss , this PR focus on inference, but there is one file overlapped. This PR can go first, and after merged, I will update my PR to only contain FT related code changes. thanks

mandy-li · 2023-12-07T19:01:15Z

LGTM

HuggingFaceDocBuilderDev · 2023-12-08T20:32:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

regisss

LGTM!

The code style check failed. Could you run the following please?

pip install --upgrade ruff
make style

optimum/habana/transformers/trainer.py

regisss · 2023-12-11T21:26:40Z

@wszczurekhabana There is a merge conflict to solve since #589 was merged

regisss

LGTM

wszczurekhabana requested review from ssarkar2, bhargaveede, vivekgoe, mandy-li and libinta as code owners December 6, 2023 10:13

wszczurekhabana requested a review from a user December 6, 2023 10:13

wszczurekhabana requested a review from regisss as a code owner December 6, 2023 10:13

puneeshkhanna reviewed Dec 6, 2023

View reviewed changes

examples/text-generation/run_generation.py Outdated Show resolved Hide resolved

puneeshkhanna reviewed Dec 6, 2023

View reviewed changes

optimum/habana/transformers/models/llama/modeling_llama.py Show resolved Hide resolved

mandy-li added the run-test Run CI for PRs from external contributors label Dec 6, 2023

regisss added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Dec 8, 2023

regisss reviewed Dec 8, 2023

View reviewed changes

optimum/habana/transformers/trainer.py Outdated Show resolved Hide resolved

optimum/habana/transformers/trainer.py Outdated Show resolved Hide resolved

wszczurekhabana added 3 commits December 12, 2023 01:50

Support for FlashAttention in Llama2

6c8a1e5

Align text for help in argument and pass additional arguments

0adb157

Code formatting; fix trainer parameter

cc9c8c1

regisss added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Dec 12, 2023

regisss approved these changes Dec 12, 2023

View reviewed changes

regisss merged commit ff156d6 into huggingface:main Dec 12, 2023

skaulintel mentioned this pull request Apr 18, 2024

add mistral flash attention HabanaAI/optimum-habana-fork#172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for FlashAttention in Llama2 #584

Support for FlashAttention in Llama2 #584

wszczurekhabana commented Dec 6, 2023 •

edited

Loading

wszczurekhabana commented Dec 6, 2023

puneeshkhanna commented Dec 6, 2023 •

edited

Loading

regisss commented Dec 6, 2023

mandy-li commented Dec 6, 2023

mandy-li commented Dec 7, 2023

HuggingFaceDocBuilderDev commented Dec 8, 2023

regisss left a comment

regisss commented Dec 11, 2023

regisss left a comment

Support for FlashAttention in Llama2 #584

Support for FlashAttention in Llama2 #584

Conversation

wszczurekhabana commented Dec 6, 2023 • edited Loading

What does this PR do?

wszczurekhabana commented Dec 6, 2023

puneeshkhanna commented Dec 6, 2023 • edited Loading

regisss commented Dec 6, 2023

mandy-li commented Dec 6, 2023

mandy-li commented Dec 7, 2023

HuggingFaceDocBuilderDev commented Dec 8, 2023

regisss left a comment

Choose a reason for hiding this comment

regisss commented Dec 11, 2023

regisss left a comment

Choose a reason for hiding this comment

wszczurekhabana commented Dec 6, 2023 •

edited

Loading

puneeshkhanna commented Dec 6, 2023 •

edited

Loading