Skip to content

allreduce benchmark

workingloong edited this page Jul 29, 2020 · 15 revisions

AllReduce Benchmark

Minikube

Batch size : 64 Number of batches per task: 50 Dataset: cifar10 and image size is (32, 32, 3)

Worker resource: cpu=0.3,memory=2048Mi,ephemeral-storage=1024Mi

Resnet50

Resnet50 is a computation-intensive model and its trainable parameters number for cifar10 is 23,555,082.

Workers computation/communication Speed Speedup Ratio
1 0% 3.1 images/s 1
2 10: 1 5.65 images/s 1.82

MobileNetV2

MobileNetV2 is a communication-intensive model and its trainable parameters number of MoblieNetV2 is 2,236,682.

Workers computation/communication Speed Speedup Ratio
1 - 29 images/s 1
2 10: 3 44.7 images/s 1.54
3 10: 6 57.2 images/s 1.97

ASI

CPU only

Worker resource: cpu=4,memory=8192Mi,ephemeral-storage=1024Mi

MobileNetV2

Workers communication Speed Speedup Ratio
1 0% 353.6 images/s 1
2 24% 503 images/s 1.42
4 44.7% 680 images/s 1.92
8 66.7% 648 images/s 1.83

Resnet50

Workers communication Speed Speedup Ratio
1 0% 26.7 images/s 1
2 18% 41 images/s 1.57
4 25% 68.4 images/s 2.56
8 32% 123 images/s 4.61

GPU

Data: ImageNet shape (256, 256, 3) mini-batch size : 64

A task per 16 minibatches

MobileNetV2

1024 images/task

Workers speed total task time allreduce time tensor.numpy() time apply_gradients
1 (local) 169 images/s 6.06s - - 5.59s
2 246 images/s 8.34s 7.25026 5.79s 0.6s
4 401 images/s 10.2029s 8.9s 5.78s 0.71s

Resnet50

Workers speed total task time allreduce time tensor.numpy() time apply_gradients
1 (local) 168 images/s 6.1s - - 4.16s
2 148 images/s 13.76s 10.36s 5.04s 1.35s
4 228 images/s 18s 14.67s 5.14s 1.30s

Compression model with Conv2DTranspose

Workers speed total task time allreduce time tensor.numpy() time apply_gradients
1 (local) 109 images/s 9.36s - - 8.95s
2 176 images/s 11.65s 1.47s 9.36s 0.42s
4 328 images/s 12.47s 2.44s 9.32s 0.37s
Clone this wiki locally