tf.distribute.MirroredStrategy - suggestion for improving test mean_iou for segmentation network using distributed training #1326

purvang3 · 2022-07-01T14:21:05Z

purvang3
Jul 1, 2022

Hi Ross and community, As I am working on distributed training, I am facing issues with model convergence and would like to know if you have any suggestion for improvement. Below is the summary.

I am using tensorflow 2.5.0 and implemented semantic segmatation network. used DeepLab_v3_plus network with ResNet101 backbone, adam optimizer and Categorical cross entropy loss to train network. I have first developed code for single gpu and that was getting test accuracy (mean_iou) of 54% trained for 96 epochs. Then added tf MirroredStrategy (one machine) in code to support for multi gpu training. Surprisingly with 2 gpus, training for 48 epochs, test mean_iou is just 27% and training with 4 gpus, for 24 epochs, test mean_iou can around 12% for same dataset.

Code I have modified to support multi-gpu training from single-gpu training.

By following tensorflow blog for distributed training, created mirrored strategy and created model, model compilation and dataset_generator inside strategy scope. As per my understanding, by doing so, model.fit() method will take care of synchronization of gradients and distributing data on each gpus for training. Though code was running without any error, and also training time reduced compared to single gpu for same number of image training, test mean_iou keep getting worst with more number of gpus.
Replaced BatchNormalization with SyncBatchNormalization, but no improvement.
used warmup learning rate with linear scaling of learning rate with number of gpus, but no improvement.
in cross entropy loss, used both losses_utils.ReductionV2.AUTO and losses_utils.ReductionV2.NONE.

loss = ce(y_true, y_pred)
# reshape loss for each sample (BxHxWxC -> BxN)
# Normalize loss by number of non zero elements and sum for each sample and mean across all samples.

using .AUTO/.NONE options, I am not scaling loss by global_batch_size understanding tf will take care of it and I am already normalizing for each gpus. but with both options, didn't get any luck.

changed data_generator to tf.data.Dataset. Though it has helped in training time, but test mean_iou become even worst.
I would appreciate if any lead or suggestion for improving test_iou in distributed training. let me know if you need any additional details.

Thank you

purvang3 · 2022-07-20T21:59:35Z

purvang3
Jul 20, 2022
Author

There was problem in loss scaling when using multiple gpus. Fixing that resolved the issue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

tf.distribute.MirroredStrategy - suggestion for improving test mean_iou for segmentation network using distributed training #1326

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Uh oh!

tf.distribute.MirroredStrategy - suggestion for improving test mean_iou for segmentation network using distributed training #1326

Uh oh!

purvang3 Jul 1, 2022

Replies: 1 comment

Uh oh!

Uh oh!

purvang3 Jul 20, 2022 Author

purvang3
Jul 1, 2022

purvang3
Jul 20, 2022
Author