Update PyTorch Llama3 70B recipe to calculate metrics from profile #29

bhavya01 · 2025-02-14T22:44:43Z

I only updated the step time calculation in the branch [flash_attention_minibatch_v6e](https://github.com/pytorch-tpu/transformers/compare/flash_attention_minibatch_v6e) with the commit pytorch-tpu/transformers@b185651

The output of the script looks as follows:

[worker 0] Training completed. Do not forget to share your model on huggingface.co/models =)
[worker 0] 
[worker 0] 
[worker 0] Parsing /tmp/home/profile/plugins/profile/2025_02_14_00_13_42/127.0.0.1_9012.xplane.pb
[worker 0] Plane ID: 2, Name: /device:TPU:0
[worker 0]   Line ID: 2, Name: XLA Modules
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.904149665406 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.901553873328 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.905248442828 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.9037382375 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.90603396 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.903193477172 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.905565544234 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.902798507172 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.90552500525 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.902875035156 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.905281515078 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.902723529422 s
[worker 0]     Event Metadata Name: SyncTensorsGraph.66628(14940262714490726846), ID: 25941, Duration: 2.905252619422 s
[worker 0] Got 13 iterations
100%|██████████| 20/20 [04:39<00:00, 13.99s/it]
[worker 0] [INFO|modelcard.py:450] 2025-02-14 22:28:15,027 >> Dropping the following result as it does not have all the necessary fields:
[worker 0] {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}, 'dataset': {'name': 'wikitext wikitext-103-raw-v1', 'type': 'wikitext', 'args': 'wikitext-103-raw-v1'}}
[worker 0] {'train_runtime': 37.7539, 'train_samples_per_second': 1.377, 'train_steps_per_second': 0.344, 'train_loss': 9.57159881591797, 'epoch': 0.01}
[worker 0] ***** train metrics *****
[worker 0]   epoch                    =     0.0057
[worker 0]   total_flos               =  4688805GF
[worker 0]   train_loss               =     9.5716
[worker 0]   train_runtime            = 0:00:37.75
[worker 0]   train_samples_per_second =      1.377
[worker 0]   train_steps_per_second   =      0.344

ManfeiBai · 2025-02-14T23:10:56Z

Very cool work, @bhavya01, do you have plan to add this new metric to Mixtral8_7B too?

bhavya01 · 2025-02-14T23:31:10Z

Very cool work, @bhavya01, do you have plan to add this new metric to Mixtral8_7B too?

Yes, will do it for Mixtral too

zpcore · 2025-02-18T17:45:20Z

training/trillium/Llama3-70B-PyTorch/GCE/tpu.Dockerfile

@@ -1,6 +1,5 @@
 # Base package containing nightly PyTorch/XLA
-ARG BASE_IMAGE=us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm
-FROM ${BASE_IMAGE}
+FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_cxx11_20250211


Did you try run with the 20250211 base image with the full pod? Context: pytorch/xla#8683

Good catch! Let me try running on full pod as well.

pgmoka · 2025-02-18T21:21:28Z

training/trillium/Llama3-70B-PyTorch/GCE/README.md

+(`us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-xla/llama3-70b:feb14build`).
+The docker image uses torch and torch_xla nightly build from 02/11/2024


Could we not create a label for the currently used test, and then rotate that between different versions? This could avoid possible human error, and removes the requirement to change version.

Update dockerfile to print metrics

c449121

bhavya01 requested a review from zpcore February 14, 2025 22:44

bhavya01 self-assigned this Feb 14, 2025

zpcore reviewed Feb 18, 2025

View reviewed changes

pgmoka reviewed Feb 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update PyTorch Llama3 70B recipe to calculate metrics from profile #29

Update PyTorch Llama3 70B recipe to calculate metrics from profile #29

Uh oh!

bhavya01 commented Feb 14, 2025 •

edited

Loading

Uh oh!

ManfeiBai commented Feb 14, 2025

Uh oh!

bhavya01 commented Feb 14, 2025

Uh oh!

zpcore Feb 18, 2025

Uh oh!

bhavya01 Feb 18, 2025

Uh oh!

pgmoka Feb 18, 2025

Uh oh!

Uh oh!

		(`us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-xla/llama3-70b:feb14build`).
		The docker image uses torch and torch_xla nightly build from 02/11/2024

Update PyTorch Llama3 70B recipe to calculate metrics from profile #29

Are you sure you want to change the base?

Update PyTorch Llama3 70B recipe to calculate metrics from profile #29

Uh oh!

Conversation

bhavya01 commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ManfeiBai commented Feb 14, 2025

Uh oh!

bhavya01 commented Feb 14, 2025

Uh oh!

zpcore Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

bhavya01 Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

pgmoka Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bhavya01 commented Feb 14, 2025 •

edited

Loading