Benchmark Layer Skip #1

mostafaelhoushi · 2024-10-28T05:07:12Z

I followed this blog and implemented early exit self-speculation here.

mostafaelhoushi · 2024-10-28T19:12:42Z

I ran for a large number of models, comparing 2-model speculative decoding and self-speculative decoding and put the results here (note there is a separate sheet for each task):
https://docs.google.com/spreadsheets/d/1YASFEJl5WPmiXbtW-5-PA5nVqtXlZM9YifaAuv49hhI/edit?usp=sharing

Cc @gante

gante · 2024-11-04T15:09:22Z

Perfect, thank you for opening the PR 💛 I'll merge this one as soon as the transformers PR is merged (on main, early_exit doesn't exist as an arg to generate)

mostafaelhoushi · 2024-11-04T15:20:51Z

experiments/faster_generation/utils.py

@@ -52,7 +53,7 @@ def run_model(args, processor_cls, model_cls, run_prediction_loop):
    tokenizer = processor_cls.from_pretrained(args.model)

    if args.max_gpu_memory is None:  # fails if it doesn't fit in a GPU
-        max_memory = {0: "100GiB", "cpu": "0GiB"}
+        max_memory = {0: "100GiB", "cpu": "50GiB"}


n00b question: why do we need to modify max_memory?
For Llama2 70B ran on 8 A100 GPUs, I had to replace this line with:
max_memory = None
to get it to work

haha for no good reason: long ago, when I wrote this script, my setup crashed if a) the model didn't fully fit on the GPU OR b) I didn't set a limit slightly below the device capacity -- e.g. 20GiB on a RTX3090 (24GB)

can and should be replaced

Haha ... OK. I submitted a commit to keep max memory to None if it is not specified by the user

mostafaelhoushi · 2024-11-19T13:23:03Z

After merging huggingface/transformers#34240 I think this PR is ready to merge.
I have also updated the early_exit argument to assistant_early_exit.

mostafaelhoushi · 2024-11-19T13:31:16Z

Just noticed there are conflicts. Working now on resolving them.

mostafaelhoushi · 2024-11-19T14:50:51Z

I re-ran commands to verify that they are working and fixed some further errors that popped up.

mostafaelhoushi · 2024-12-13T02:37:01Z

Hi @gante . Just a gentle ping if you would like to merge the PR

mostafaelhoushi commented Nov 4, 2024

View reviewed changes

mostafaelhoushi added 3 commits November 19, 2024 13:36

add option for auxiliary early exit

9875d97

fix error that complains about 0GB

bf9b903

default max memory is None

adaa02c

mostafaelhoushi force-pushed the layer-skip branch from 7b320bc to adaa02c Compare November 19, 2024 13:44

mostafaelhoushi added 3 commits November 19, 2024 14:02

address error message on max_tokens vs max_tokens_generated

0292589

handle when there is assistant early exit but no assistant model

1cad726

fix further errors

25bd00f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Layer Skip #1

Benchmark Layer Skip #1

mostafaelhoushi commented Oct 28, 2024 •

edited

Loading

mostafaelhoushi commented Oct 28, 2024

gante commented Nov 4, 2024

mostafaelhoushi Nov 4, 2024

gante Nov 4, 2024

mostafaelhoushi Nov 4, 2024

mostafaelhoushi commented Nov 19, 2024

mostafaelhoushi commented Nov 19, 2024

mostafaelhoushi commented Nov 19, 2024

mostafaelhoushi commented Dec 13, 2024

Benchmark Layer Skip #1

Are you sure you want to change the base?

Benchmark Layer Skip #1

Conversation

mostafaelhoushi commented Oct 28, 2024 • edited Loading

mostafaelhoushi commented Oct 28, 2024

gante commented Nov 4, 2024

mostafaelhoushi Nov 4, 2024

Choose a reason for hiding this comment

gante Nov 4, 2024

Choose a reason for hiding this comment

mostafaelhoushi Nov 4, 2024

Choose a reason for hiding this comment

mostafaelhoushi commented Nov 19, 2024

mostafaelhoushi commented Nov 19, 2024

mostafaelhoushi commented Nov 19, 2024

mostafaelhoushi commented Dec 13, 2024

mostafaelhoushi commented Oct 28, 2024 •

edited

Loading