Long step time (0.1s) for ResNet-18 training on Imagenet #1542
-
Hi! I want to train a ResNet-18 on Imagenet using timm on a cluster with a V100 GPU, pytorch 1.12.
Using 10 CPUs, with the data stored on an SSD and having replaced Pillow with Pillow-SIMD, as per the recommendations of this disccusion, I obtain the following numbers for the beginning of my training:
Even though the data time appears quite low, I think that it actually is the data loading that is problematic : indeed I know by timing it offline that the forward and backward passes of the resnet-18 together take 0.02s. I would like to know if this is expected and if not what I can do to accelerate the training. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 8 replies
-
@zaccharieramzi average data time still looks fairly high (number in brackets is the avg_, looks like there's a pretty slow start for the system and the avg throughput is slowly increase towards approx 1200im/sec, not sure what sort of machine it is, but the persistent SSD disks on typical cloud instances aren't very fast. A resnet18, even in float32 should be a lot faster than that, yes. I use TFDS for imagenet in most shared drive / cloud scenarios and only use raw folder/file datasets on local machines. |
Beta Was this translation helpful? Give feedback.
-
@zaccharieramzi k, so dataloading all good I don't have a v100 handy, it broke :( but I have a Titan RTX which is similar but a bit slower... So yeah, the numbers you see are normal I think. I get 1000 im/sec with your settings and no loader bottleneck. Not sure if you intended to train with these settings, but if you want to increase throughput you should up the batch size, use amp, and enable channels_last
|
Beta Was this translation helpful? Give feedback.
@zaccharieramzi average data time still looks fairly high (number in brackets is the avg_, looks like there's a pretty slow start for the system and the avg throughput is slowly increase towards approx 1200im/sec, not sure what sort of machine it is, but the persistent SSD disks on typical cloud instances aren't very fast. A resnet18, even in float32 should be a lot faster than that, yes.
I use TFDS for imagenet in most shared drive / cloud scenarios and only use raw folder/file datasets on local machines.