Add MaxText Llama 3.1 70B training with GCS recipe

raymondzouu · raymondzouu · commit f8dcddc6ca96 · 2025-04-10T00:15:58.000Z
diff --git a/training/trillium/Llama3.1-70B-MaxText-with-Storage/README.md b/training/trillium/Llama3.1-70B-MaxText-with-Storage/README.md
@@ -0,0 +1,184 @@
+# Instructions for training Llama3.1-70B-MaxText on TPU trillium (v6e-256) with Google Cloud Storage (GCS)
+
+## GCS Bucket setup
+1. Create a bucket with a dataset for dataloading and a bucket to write checkpoints. To create a regional HNS bucket use the following command:
+```
+# Set variables
+export DATASET_BUCKET="your-dataloading-bucket-name"
+export CHECKPOINT_BUCKET="your-checkpoint-bucket-name"
+export REGION="us-central1"
+
+# Create dataset bucket
+gcloud storage buckets create gs://${DATASET_BUCKET} --location=${REGION}  --default-storage-class=Standard --enable-hierarchical-namespace --uniform-bucket-level-access
+
+# Create checkpoint bucket  
+gcloud storage buckets create gs://${CHECKPOINT_BUCKET} --location=${REGION}  --default-storage-class=Standard --enable-hierarchical-namespace --uniform-bucket-level-access
+```
+Replace the following values:  
+- `<DATASET_BUCKET>`:the name of your Cloud Storage bucket with training dataset. Do not include the gs:// prefix  
+- `<CHECKPOINT_BUCKET>`: the name of your Cloud Storage bucket where checkpoints will written. Do not include the gs:// prefix
+- `<DATASET_STORAGE_NAME>`: name of the XPK storage for dataset bucket  
+- `<CHECKPOINT_STORAGE_NAME>`: name of the XPK storage for checkpoint bucket  
+- `<REGION>`: the region where your cluster is located ([available locations](https://cloud.google.com/storage/docs/locations#location-r))
+
+2. Follow these [instructions](https://github.com/AI-Hypercomputer/maxtext/blob/b93beba652db6b3f4e6c82dc48a83b03229f5d3a/getting_started/Data_Input_Pipeline.md#tfds-pipeline) to download the Allenai c4 dataset which is used in this recipe.
+Then follow these [instructions](https://github.com/google/array_record/tree/main/beam) to convert the dataset into ArrayRecord.
+
+## XPK setup
+1. Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK.
+2. GCSFuse lets you mount and access Cloud Storage buckets as local file systems, so applications can read and write objects in your bucket using standard file system semantics. It adds pv and pvc to the cluster
+https://github.com/AI-Hypercomputer/xpk?tab=readme-ov-file#storage. You'll need to use the below commands to create XPK storage resources for both the dataset and checkpoint buckets in order to mount them to the MaxText workload using GCSFuse.
+```
+
+export PROJECT="your-gcp-project" # Update
+export CLUSTER="your-xpk-cluster" # Update
+export ZONE="your-cluster-zone" # Update
+export OUTPUT_DIR="checkpointing-mount-point" # Update
+export RECIPE_REPO="path-to-this-recipe-repo" # Update
+
+cd ~/xpk
+
+python3 xpk.py storage attach $DATASET_STORAGE_NAME type=gcsfuse project=$PROJECT cluster=$CLUSTER zone=$ZONE mountpoint=/tmp/dataset readonly=false bucket=$DATASET_BUCKET size=64 automount=false manifest=$RECIPE_REPO/tpu-recipes/training/trillium/Llama3.1-70B-MaxText-with-Storage/dataset_pvc.yaml
+
+python3 xpk.py storage attach $CHECKPOINT_STORAGE_NAME type=gcsfuse project=$PROJECT cluster=$CLUSTER zone=$ZONE mountpoint=/tmp/ckpt readonly=false bucket=$CHECKPOINT_BUCKET size=64 automount=false manifest=$RECIPE_REPO/tpu-recipes/training/trillium/Llama3.1-70B-MaxText-with-Storage/checkpoint_pvc.yaml
+```
+For the dataset bucket and checkpoint bucket use separate manifest files `checkpoint_pvc.yaml` and `dataset_pvc.yaml` from this repo.
+Creating a bucket and xpk storage is a one time setup.
+
+## Prep for MaxText
+
+### Install MaxText and Build Docker Image
+Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build the docker image.
+
+In step 2, use the jax-stable-stack image containing JAX 0.5.2:
+```
+BASE_IMAGE=us-docker.pkg.dev/cloud-tpu-images/jax-stable-stack/tpu:jax0.5.2-rev1
+bash docker_build_dependency_image.sh DEVICE=tpu MODE=stable_stack BASEIMAGE=${BASE_IMAGE}
+```
+
+## Run MaxText Llama3.1-70B workloads on GKE
+
+### Starting workload
+
+From the MaxText root directory, start your Llama3.1-70B workload.
+
+Run MaxText Llama 3.1 70B with synthetic data and no checkpointing:
+```
+python3 benchmarks/benchmark_runner.py xpk \
+    project=$PROJECT \
+    zone=$ZONE \
+    device_type=v6e-256 \
+    num_slices=1  \
+    cluster_name=$CLUSTER \
+    base_output_directory=$OUTPUT_DIR \
+    model_name="llama3_1_70b_8192_synthetic" \
+    num_steps=100 \
+    base_docker_image=maxtext_base_image
+```
+
+Run MaxText Llama 3.1 70B with checkpointing and loading real data from GCS:
+```
+python3 benchmarks/benchmark_runner.py xpk \
+    project=$PROJECT \
+    zone=$ZONE \
+    device_type=v6e-256 \
+    num_slices=1  \
+    cluster_name=${CLUSTER} \
+    base_output_directory=/tmp/ckpt \
+    model_name="llama3_1_70b_8192_rd_ckpt_grain" \
+    num_steps=100 \
+    base_docker_image=maxtext_base_image \
+    xpk_storage=$DATASET_STORAGE_NAME xpk_storage=$CHECKPOINT_STORAGE_NAME
+```
+
+If you would like to run on multiple slices of v6e-256, you may modify the `--num_slices` flag.
+
+### Workload Details
+
+For reference, here are the `llama3_1_70b_8192_synthetic` and `llama3_1_70b_8192_rd_ckpt_grain` workload details:
+
+```
+  MaxTextModel(
+        model_name="llama3_1-70b-8192",
+        model_type="llama3.1-70b",
+        tuning_params={
+            "per_device_batch_size": 4,
+            "ici_fsdp_parallelism": -1,
+            "remat_policy": "custom",
+            "decoder_layer_input": "offload",
+            "query_proj": "offload",
+            "key_proj": "offload",
+            "value_proj": "offload",
+            "max_target_length": 8192,
+            "attention": "flash",
+            "use_iota_embed": True,
+            "dataset_path": "gs://max-datasets-rogue",
+            "dataset_type": "synthetic",
+            "enable_checkpointing": False,
+            "sa_block_q": 2048,
+            "sa_block_kv": 2048,
+            "sa_block_kv_compute": 2048,
+            "sa_block_q_dkv": 2048,
+            "sa_block_kv_dkv": 2048,
+            "sa_block_kv_dkv_compute": 2048,
+            "sa_block_q_dq": 2048,
+            "sa_block_kv_dq": 2048,
+            "sa_use_fused_bwd_kernel": True,
+            "profiler": "xplane",
+            "skip_first_n_steps_for_profiler": 10,
+            "profiler_steps": 5,
+        },
+        xla_flags=(
+            xla_flags_library.DENSE_VMEM_LIMIT_FLAG
+            + xla_flags_library.LAYOUT_FOR_ALL_REDUCE_SCATTER
+            + xla_flags_library.DATA_PARALLEL_OVERLAP
+            + xla_flags_library.CF_FOR_ALL_GATHER
+            + xla_flags_library.HOST_OFFLOAD_FLAGS
+        ),
+    )
+
+
+    MaxTextModel(
+        model_name="llama3_1_70b_8192_rd_ckpt_grain",
+        model_type="llama3.1-70b",
+        tuning_params={
+            "per_device_batch_size": 2,
+            "ici_fsdp_parallelism": -1,
+            "remat_policy": "custom",
+            "decoder_layer_input": "offload",
+            "query_proj": "offload",
+            "key_proj": "offload",
+            "value_proj": "offload",
+            "max_target_length": 8192,
+            "attention": "flash",
+            "use_iota_embed": True,
+            "dataset_path": "/tmp/dataset",
+            "dataset_type": "grain",
+            "grain_train_files": "/tmp/dataset/array-record/c4/en/3.0.1/c4-train.array_record*",
+            "grain_worker_count": 24,
+            "enable_checkpointing": True,
+            "async_checkpointing": True,
+            "checkpoint_period": 20,
+            "sa_block_q": 2048,
+            "sa_block_kv": 2048,
+            "sa_block_kv_compute": 2048,
+            "sa_block_q_dkv": 2048,
+            "sa_block_kv_dkv": 2048,
+            "sa_block_kv_dkv_compute": 2048,
+            "sa_block_q_dq": 2048,
+            "sa_block_kv_dq": 2048,
+            "sa_use_fused_bwd_kernel": True,
+        },
+        xla_flags=(
+            xla_flags_library.DENSE_VMEM_LIMIT_FLAG
+            + xla_flags_library.LAYOUT_FOR_ALL_REDUCE_SCATTER
+            + xla_flags_library.DATA_PARALLEL_OVERLAP
+            + xla_flags_library.CF_FOR_ALL_GATHER
+            + xla_flags_library.HOST_OFFLOAD_FLAGS
+            + xla_flags_library.ENABLE_SPARSECORE_OFFLOADING_FOR_ALL_REDUCE 
+            +  " --xla_tpu_iova_dma_chunk_size_bytes=104857"
+        ),
+  )
+```
+
+This equivalent workload code can be found in the [maxtext_trillium_model_configs.py](https://github.com/AI-Hypercomputer/maxtext/blob/1e4d513ad70dd4074d975a9f7936295008d4b900/benchmarks/maxtext_trillium_model_configs.py#L1103-L1146) file within the MaxText repository.
diff --git a/training/trillium/Llama3.1-70B-MaxText-with-Storage/checkpoint_pvc.yaml b/training/trillium/Llama3.1-70B-MaxText-with-Storage/checkpoint_pvc.yaml
@@ -0,0 +1,42 @@
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+  name: checkpoint-bucket-pv
+spec:
+  accessModes:
+  - ReadWriteMany
+  capacity:
+    storage: 64Gi
+  persistentVolumeReclaimPolicy: Retain
+  storageClassName: gcsfuse-sc # dummy storage class
+  claimRef:
+    namespace: default
+    name: checkpoint-bucket-pvc
+  mountOptions:
+    - metadata-cache:ttl-secs:-1
+    - metadata-cache:negative-ttl-secs:0
+    - metadata-cache:stat-cache-max-size-mb:-1
+    - metadata-cache:type-cache-max-size-mb:-1
+    - file-cache:enable-parallel-downloads:false
+    - file-system:kernel-list-cache-ttl-secs:0
+    - write:enable-streaming-writes:true
+    - file-system:precondition-errors:false
+  csi:
+    driver: gcsfuse.csi.storage.gke.io
+    volumeHandle: your-trillium-chkpt # unique bucket name
+    volumeAttributes:
+      gcsfuseMetadataPrefetchOnMount: "true"
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: checkpoint-bucket-pvc
+  namespace: defaultls
+spec:
+  accessModes:
+  - ReadWriteMany
+  resources:
+    requests:
+      storage: 64Gi
+  volumeName: checkpoint-bucket-pv
+  storageClassName: gcsfuse-sc # dummy storage class
diff --git a/training/trillium/Llama3.1-70B-MaxText-with-Storage/dataset_pvc.yaml b/training/trillium/Llama3.1-70B-MaxText-with-Storage/dataset_pvc.yaml
@@ -0,0 +1,40 @@
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+  name: dataset-bucket-pv
+spec:
+  accessModes:
+  - ReadWriteMany
+  capacity:
+    storage: 64Gi
+  persistentVolumeReclaimPolicy: Retain
+  storageClassName: gcsfuse-sc # dummy storage class
+  claimRef:
+    namespace: default
+    name: dataset-bucket-pvc
+  mountOptions:
+    - metadata-cache:ttl-secs:-1
+    - metadata-cache:stat-cache-max-size-mb:-1
+    - metadata-cache:type-cache-max-size-mb:-1
+    - file-cache:enable-parallel-downloads:false
+    - file-system:kernel-list-cache-ttl-secs:-1
+    - write:enable-streaming-writes:true
+  csi:
+    driver: gcsfuse.csi.storage.gke.io
+    volumeHandle: your-trillium-dataloading # unique bucket name
+    volumeAttributes:
+      gcsfuseMetadataPrefetchOnMount: "true"
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: dataset-bucket-pvc
+  namespace: default
+spec:
+  accessModes:
+  - ReadWriteMany
+  resources:
+    requests:
+      storage: 64Gi
+  volumeName: dataset-bucket-pv
+  storageClassName: gcsfuse-sc # dummy storage class
diff --git a/training/trillium/Llama3.1-70B-MaxText-with-Storage/llama3-1-70B-real-data-and-ckpt-1xv6e-256.sh b/training/trillium/Llama3.1-70B-MaxText-with-Storage/llama3-1-70B-real-data-and-ckpt-1xv6e-256.sh
@@ -0,0 +1,11 @@
+python3 benchmarks/benchmark_runner.py xpk \
+    project=$PROJECT \
+    zone=$ZONE \
+    device_type=v6e-256 \
+    num_slices=1  \
+    cluster_name=${CLUSTER} \
+    base_output_directory=/tmp/ckpt \
+    model_name="llama3_1_70b_8192_rd_ckpt_grain" \
+    num_steps=100 \
+    base_docker_image=maxtext_base_image \
+    xpk_storage="yourdatasetbucket" xpk_storage="yourckptbucket"
diff --git a/training/trillium/Llama3.1-70B-MaxText-with-Storage/llama3-1-70B-synthetic-data-1xv6e-256.sh b/training/trillium/Llama3.1-70B-MaxText-with-Storage/llama3-1-70B-synthetic-data-1xv6e-256.sh
@@ -0,0 +1,10 @@
+python3 benchmarks/benchmark_runner.py xpk \
+    project=$PROJECT \
+    zone=$ZONE \
+    device_type=v6e-256 \
+    num_slices=1  \
+    cluster_name=$CLUSTER \
+    base_output_directory=$OUTPUT_DIR \
+    model_name="llama3_1_70b_8192_synthetic" \
+    num_steps=100 \
+    base_docker_image=maxtext_base_image