Skip to content

Update PyTorch Llama3 70B recipe to calculate metrics from profile #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bhavya01
Copy link
Collaborator

@bhavya01 bhavya01 commented Feb 14, 2025

I only updated the step time calculation in the branch [flash_attention_minibatch_v6e](https://github.com/pytorch-tpu/transformers/compare/flash_attention_minibatch_v6e) with the commit pytorch-tpu/transformers@b185651

The output of the script looks as follows:

[worker 0] Training completed. Do not forget to share your model on huggingface.co/models =)
[worker 0] 
[worker 0] 
[worker 0] Parsing /tmp/home/profile/plugins/profile/2025_02_14_00_13_42/127.0.0.1_9012.xplane.pb
[worker 0] Plane ID: 2, Name: /device:TPU:0
[worker 0]   Line ID: 2, Name: XLA Modules
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.904149665406 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.901553873328 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.905248442828 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.9037382375 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.90603396 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.903193477172 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.905565544234 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.902798507172 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.90552500525 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.902875035156 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.905281515078 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.902723529422 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.905252619422 s
[worker 0] Got 13 iterations
100%|██████████| 20/20 [04:39<00:00, 13.99s/it]
[worker 0] [INFO|modelcard.py:450] 2025-02-14 22:28:15,027 >> Dropping the following result as it does not have all the necessary fields:
[worker 0] {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}, 'dataset': {'name': 'wikitext wikitext-103-raw-v1', 'type': 'wikitext', 'args': 'wikitext-103-raw-v1'}}
[worker 0] {'train_runtime': 37.7539, 'train_samples_per_second': 1.377, 'train_steps_per_second': 0.344, 'train_loss': 9.57159881591797, 'epoch': 0.01}
[worker 0] ***** train metrics *****
[worker 0]   epoch                    =     0.0057
[worker 0]   total_flos               =  4688805GF
[worker 0]   train_loss               =     9.5716
[worker 0]   train_runtime            = 0:00:37.75
[worker 0]   train_samples_per_second =      1.377
[worker 0]   train_steps_per_second   =      0.344

@bhavya01 bhavya01 requested a review from zpcore February 14, 2025 22:44
@bhavya01 bhavya01 self-assigned this Feb 14, 2025
@ManfeiBai
Copy link
Contributor

Very cool work, @bhavya01, do you have plan to add this new metric to Mixtral8_7B too?

@bhavya01
Copy link
Collaborator Author

Very cool work, @bhavya01, do you have plan to add this new metric to Mixtral8_7B too?

Yes, will do it for Mixtral too

@@ -1,6 +1,5 @@
# Base package containing nightly PyTorch/XLA
ARG BASE_IMAGE=us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm
FROM ${BASE_IMAGE}
FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_cxx11_20250211
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try run with the 20250211 base image with the full pod? Context: pytorch/xla#8683

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Let me try running on full pod as well.

Comment on lines +31 to +32
(`us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-xla/llama3-70b:feb14build`).
The docker image uses torch and torch_xla nightly build from 02/11/2024
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not create a label for the currently used test, and then rotate that between different versions? This could avoid possible human error, and removes the requirement to change version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants