What is the reason for this call to synchronize? #1388
-
I was wondering if this call to synchronize is necessary: https://github.com/rwightman/pytorch-image-models/blob/master/train.py#L753 ? I could not find it in, e.g., https://github.com/pytorch/examples/blob/main/imagenet/main.py and we also don't have in |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
@mitchellnw it's not necessary in this train script and could be removed since the only use of the output in logging is via an item (implicit synchronization), there might be a possible issue with the loss reduction for distributed being unreliable w/o? That should be checkd. Removing it could slightu increase the throughput but would have to measure from start -> end of epoch and not rely on the avg_meter as it'd potentially make the batch-to-bach timing unreliable... it is necessary in the |
Beta Was this translation helpful? Give feedback.
@mitchellnw it's not necessary in this train script and could be removed since the only use of the output in logging is via an item (implicit synchronization), there might be a possible issue with the loss reduction for distributed being unreliable w/o? That should be checkd.
Removing it could slightu increase the throughput but would have to measure from start -> end of epoch and not rely on the avg_meter as it'd potentially make the batch-to-bach timing unreliable...
it is necessary in the
bits_and_tpu
branch where taking output of model and accumulating in another device tensor appears to cause a race:https://github.com/rwightman/pytorch-image-models/blob/bits_and_tpu/train.py#L777-L787