Skip to content

Commit e90183f

Browse files
RecML authorsrecml authors
authored andcommitted
[RecML] fix DLRM benchmark script issue and README instruction.
1. Reduce the batch size from 135168 to 4224. The existing batch size is too large as compared to the default dataset that's used, which causes parsing issues. 2. Fix the script file path typo, and make scripts executable before running it 3. Downgrade protobuf version to avoid incompatible error between python and protobuf library. PiperOrigin-RevId: 775550565
1 parent 1f41fca commit e90183f

File tree

4 files changed

+7
-9
lines changed

4 files changed

+7
-9
lines changed

recml/inference/benchmarks/DLRM_DCNv2/ckpt_load_and_eval.sh

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,14 @@ export XLA_FLAGS=
66

77
export TPU_NAME=<TPU_NAME>
88
export LEARNING_RATE=0.0034
9-
export BATCH_SIZE=135168
9+
export BATCH_SIZE=4224
1010
export EMBEDDING_SIZE=128
1111
export MODEL_DIR=/tmp/
1212
export FILE_PATTERN=gs://qinyiyan-vm/mlperf-dataset/criteo_merge_balanced_4224/train-*
1313
export NUM_STEPS=28000
1414
export CHECKPOINT_INTERVAL=1500
1515
export EVAL_INTERVAL=1500
16-
export EVAL_FILE_PATTER=gs://qinyiyan-vm/mlperf-dataset/criteo_merge_balanced_4224/eval-*
16+
export EVAL_FILE_PATTERN=gs://qinyiyan-vm/mlperf-dataset/criteo_merge_balanced_4224/eval-*
1717
export EVAL_STEPS=660
1818
export MODE=eval
1919
export EMBEDDING_THRESHOLD=21000
@@ -23,7 +23,6 @@ export RESTORE_CHECKPOINT=true
2323

2424

2525
python recml/inference/models/jax/DLRM_DCNv2/dlrm_main.py \
26-
2726
--learning_rate=${LEARNING_RATE} \
2827
--batch_size=${BATCH_SIZE} \
2928
--embedding_size=${EMBEDDING_SIZE} \

recml/inference/benchmarks/DLRM_DCNv2/train_and_checkpoint.sh

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,22 +6,21 @@ export XLA_FLAGS=
66

77
export TPU_NAME=<TPU_NAME>
88
export LEARNING_RATE=0.0034
9-
export BATCH_SIZE=135168
9+
export BATCH_SIZE=4224
1010
export EMBEDDING_SIZE=128
1111
export MODEL_DIR=/tmp/
1212
export FILE_PATTERN=gs://qinyiyan-vm/mlperf-dataset/criteo_merge_balanced_4224/train-*
1313
export NUM_STEPS=28000
1414
export CHECKPOINT_INTERVAL=1500
1515
export EVAL_INTERVAL=1500
16-
export EVAL_FILE_PATTER=gs://qinyiyan-vm/mlperf-dataset/criteo_merge_balanced_4224/eval-*
16+
export EVAL_FILE_PATTERN=gs://qinyiyan-vm/mlperf-dataset/criteo_merge_balanced_4224/eval-*
1717
export EVAL_STEPS=660
1818
export MODE=train
1919
export EMBEDDING_THRESHOLD=21000
2020
export LOGGING_INTERVAL=1500
2121
export RESTORE_CHECKPOINT=true
2222

2323
python recml/inference/models/jax/DLRM_DCNv2/dlrm_main.py \
24-
2524
--learning_rate=${LEARNING_RATE} \
2625
--batch_size=${BATCH_SIZE} \
2726
--embedding_size=${EMBEDDING_SIZE} \

recml/inference/benchmarks/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,10 +54,10 @@ gcloud alpha compute tpus tpu-vm ssh ${TPU_NAME} --project ${PROJECT} --zone ${Z
5454
gcloud alpha compute tpus tpu-vm ssh ${TPU_NAME} --project ${PROJECT} --zone ${ZONE} --worker=all --command="pip install -U tensorflow dm-tree flax google-metrax"
5555
```
5656

57-
#### Run workload
57+
#### Make script executable & Run workload
5858

5959
Note: Please update the MODEL_NAME & TASK_NAME before running the below command
6060

6161
```
62-
gcloud alpha compute tpus tpu-vm ssh ${TPU_NAME} --project ${PROJECT} --zone ${ZONE} --worker=all --command="TPU_NAME=${TPU_NAME} ./inference/benchmarks/<MODEL_NAME>/<TASK_NAME>"
62+
gcloud alpha compute tpus tpu-vm ssh ${TPU_NAME} --project ${PROJECT} --zone ${ZONE} --worker=all --command="cd RecML && chmod +x ./recml/inference/benchmarks/<MODEL_NAME>/<TASK_NAME> && TPU_NAME=${TPU_NAME} ./recml/inference/benchmarks/<MODEL_NAME>/<TASK_NAME>"
6363
```

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ platformdirs==4.3.7
6363
pluggy==1.5.0
6464
pre-commit==4.2.0
6565
promise==2.3
66-
protobuf==5.29.4
66+
protobuf==4.21.12
6767
psutil==7.0.0
6868
pyarrow==19.0.1
6969
pygments==2.19.1

0 commit comments

Comments
 (0)