Replies: 1 comment
-
There was problem in loss scaling when using multiple gpus. Fixing that resolved the issue. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Ross and community, As I am working on distributed training, I am facing issues with model convergence and would like to know if you have any suggestion for improvement. Below is the summary.
I am using tensorflow 2.5.0 and implemented semantic segmatation network. used DeepLab_v3_plus network with ResNet101 backbone, adam optimizer and Categorical cross entropy loss to train network. I have first developed code for single gpu and that was getting test accuracy (mean_iou) of 54% trained for 96 epochs. Then added tf MirroredStrategy (one machine) in code to support for multi gpu training. Surprisingly with 2 gpus, training for 48 epochs, test mean_iou is just 27% and training with 4 gpus, for 24 epochs, test mean_iou can around 12% for same dataset.
using .AUTO/.NONE options, I am not scaling loss by global_batch_size understanding tf will take care of it and I am already normalizing for each gpus. but with both options, didn't get any luck.
I would appreciate if any lead or suggestion for improving test_iou in distributed training. let me know if you need any additional details.
Thank you
Beta Was this translation helpful? Give feedback.
All reactions