Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark Layer Skip #1

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

mostafaelhoushi
Copy link

@mostafaelhoushi mostafaelhoushi commented Oct 28, 2024

I followed this blog and implemented early exit self-speculation here.

@mostafaelhoushi
Copy link
Author

I ran for a large number of models, comparing 2-model speculative decoding and self-speculative decoding and put the results here (note there is a separate sheet for each task):
https://docs.google.com/spreadsheets/d/1YASFEJl5WPmiXbtW-5-PA5nVqtXlZM9YifaAuv49hhI/edit?usp=sharing

Cc @gante

@gante
Copy link
Owner

gante commented Nov 4, 2024

Perfect, thank you for opening the PR 💛 I'll merge this one as soon as the transformers PR is merged (on main, early_exit doesn't exist as an arg to generate)

@@ -52,7 +53,7 @@ def run_model(args, processor_cls, model_cls, run_prediction_loop):
tokenizer = processor_cls.from_pretrained(args.model)

if args.max_gpu_memory is None: # fails if it doesn't fit in a GPU
max_memory = {0: "100GiB", "cpu": "0GiB"}
max_memory = {0: "100GiB", "cpu": "50GiB"}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n00b question: why do we need to modify max_memory?
For Llama2 70B ran on 8 A100 GPUs, I had to replace this line with:
max_memory = None
to get it to work

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha for no good reason: long ago, when I wrote this script, my setup crashed if a) the model didn't fully fit on the GPU OR b) I didn't set a limit slightly below the device capacity -- e.g. 20GiB on a RTX3090 (24GB)

can and should be replaced

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha ... OK. I submitted a commit to keep max memory to None if it is not specified by the user

@mostafaelhoushi
Copy link
Author

After merging huggingface/transformers#34240 I think this PR is ready to merge.
I have also updated the early_exit argument to assistant_early_exit.

@mostafaelhoushi
Copy link
Author

Just noticed there are conflicts. Working now on resolving them.

@mostafaelhoushi
Copy link
Author

I re-ran commands to verify that they are working and fixed some further errors that popped up.

@mostafaelhoushi
Copy link
Author

Hi @gante . Just a gentle ping if you would like to merge the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants