Refactoring LLama Attention and mlp layers #589

bgoldberg-habana · 2023-12-08T11:30:29Z

Module for scope linearAllreduce
this change allows better memory consumption and better optimizations in synapse when running llama 70b on deepspeed

Module for scope linearAllreduce this change allows better memory consumption and better optimizations in synapse Change-Id: I3a30a09d6d61aece7ce605bb672e1485d3fbe1cc

HuggingFaceDocBuilderDev · 2023-12-08T11:36:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

optimum/habana/transformers/models/llama/modeling_llama.py

optimum/habana/transformers/models/modeling_all_models.py

MrGeva · 2023-12-10T18:34:14Z

LGTM

regisss

I just left a last comment that will be addressed quickly.

Besides, do you have numbers to see the kind of memory that is saved doing this?

optimum/habana/transformers/models/llama/modeling_llama.py

bgoldberg-habana · 2023-12-11T09:01:42Z

cmd line -
ENABLE_SYNAPSE_QUANTIZATION=false USE_DEFAULT_QUANT_PARAM=true UPDATE_GRAPH_OUTPUT_MME=false ENABLE_CALC_DYNAMIC_RANGE=false ENABLE_EXPERIMENTAL_FLAGS=true deepspeed --num_gpus 8 run_generation.py --model_name_or_path /mnt/weka/data/pytorch/llama2/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --kv_cache_fp8 --batch_size 50 --fp8 --reuse_cache --trim_logits --n_iterations 5 --attn_softmax_bf16 --limit_hpu_graphs --max_new_tokens 2048 --max_input_tokens 2048

pay attention i'm running already on 1.14 but i don't think the numbers changed much from 1.13

with change -
Throughput (including tokenization) = 1581.191910099665 tokens/second
Number of HPU graphs = 333
Memory allocated = 19.07 GB
Max memory allocated = 49.15 GB
Total memory available = 94.62 GB
Graph compilation duration = 524.4125659640013 seconds

reference
Throughput (including tokenization) = 1257.5571168775869 tokens/second
Number of HPU graphs = 333
Memory allocated = 27.33 GB
Max memory allocated = 87.02 GB
Total memory available = 94.62 GB
Graph compilation duration = 542.6321858290012 seconds

Refactoring LLama Attention and mlp layers

cfea517

Module for scope linearAllreduce this change allows better memory consumption and better optimizations in synapse Change-Id: I3a30a09d6d61aece7ce605bb672e1485d3fbe1cc

bgoldberg-habana requested review from mandy-li and libinta as code owners December 8, 2023 11:30

bgoldberg-habana requested a review from a user December 8, 2023 11:30

bgoldberg-habana requested a review from regisss as a code owner December 8, 2023 11:30

regisss reviewed Dec 8, 2023

View reviewed changes

optimum/habana/transformers/models/llama/modeling_llama.py Outdated Show resolved Hide resolved

optimum/habana/transformers/models/llama/modeling_llama.py Outdated Show resolved Hide resolved

optimum/habana/transformers/models/modeling_all_models.py Outdated Show resolved Hide resolved

fix CR comments

0d082b2

bgoldberg-habana requested a review from regisss December 10, 2023 18:36

regisss reviewed Dec 10, 2023

View reviewed changes

optimum/habana/transformers/models/llama/modeling_llama.py Outdated Show resolved Hide resolved

fix cr comments

d3782ae

bgoldberg-habana requested a review from regisss December 11, 2023 09:13

regisss merged commit afea217 into main Dec 11, 2023
9 checks passed

regisss deleted the scope branch December 11, 2023 13:46

regisss mentioned this pull request Dec 11, 2023

Support for FlashAttention in Llama2 #584

Merged

schoi-habana mentioned this pull request Mar 29, 2024

Update Mixtral-8x7B Optimization #836

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring LLama Attention and mlp layers #589

Refactoring LLama Attention and mlp layers #589

bgoldberg-habana commented Dec 8, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 8, 2023

MrGeva commented Dec 10, 2023

regisss left a comment

bgoldberg-habana commented Dec 11, 2023 •

edited

Loading

Refactoring LLama Attention and mlp layers #589

Refactoring LLama Attention and mlp layers #589

Conversation

bgoldberg-habana commented Dec 8, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Dec 8, 2023

MrGeva commented Dec 10, 2023

regisss left a comment

Choose a reason for hiding this comment

bgoldberg-habana commented Dec 11, 2023 • edited Loading

bgoldberg-habana commented Dec 8, 2023 •

edited

Loading

bgoldberg-habana commented Dec 11, 2023 •

edited

Loading