Replies: 1 comment
-
There was problem in loss scaling when using multiple gpus. Fixing that resolved the issue. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi Ross and community, As I am working on distributed training, I am facing issues with model convergence and would like to know if you have any suggestion for improvement. Below is the summary.
I am using tensorflow 2.5.0 and implemented semantic segmatation network. used DeepLab_v3_plus network with ResNet101 backbone, adam optimizer and Categorical cross entropy loss to train network. I have first developed code for single gpu and that was getting test accuracy (mean_iou) of 54% trained for 96 epochs. Then added tf MirroredStrategy (one machine) in code to support for multi gpu training. Surprisingly with 2 gpus, training for 48 epochs, test mean_iou is just 27% and training with 4 gpus, for 24 epochs, test mean_iou can around 12% for same dataset.
using .AUTO/.NONE options, I am not scaling loss by global_batch_size understanding tf will take care of it and I am already normalizing for each gpus. but with both options, didn't get any luck.
I would appreciate if any lead or suggestion for improving test_iou in distributed training. let me know if you need any additional details.
Thank you
Beta Was this translation helpful? Give feedback.
All reactions