training speed problem #1071
-
I have tried to train ResNeXt-50 32x4d with A100*8 by scripts as provided: From the training log, I observed that the training speed is getting slower by batches in each epoch. Do you know where the issue is? Is other hardware like CPU might be the problem? Train: 0 [ 50/834 ( 6%)] Loss: 6.940 (6.94) Time: 0.334s, 4598.09/s (0.898s, 1709.76/s) LR: 1.000e-04 Data: 0.048 (0.231)
Train: 0 [ 100/834 ( 12%)] Loss: 6.939 (6.94) Time: 0.262s, 5861.05/s (2.308s, 665.63/s) LR: 1.000e-04 Data: 0.038 (1.417)
Train: 0 [ 150/834 ( 18%)] Loss: 6.940 (6.94) Time: 0.256s, 6009.04/s (3.347s, 458.94/s) LR: 1.000e-04 Data: 0.036 (1.927)
Train: 0 [ 200/834 ( 24%)] Loss: 6.926 (6.94) Time: 11.068s, 138.78/s (3.993s, 384.71/s) LR: 1.000e-04 Data: 0.038 (1.584)
Train: 0 [ 250/834 ( 30%)] Loss: 6.934 (6.94) Time: 0.257s, 5982.28/s (4.286s, 358.35/s) LR: 1.000e-04 Data: 0.031 (1.459)
Train: 0 [ 300/834 ( 36%)] Loss: 6.925 (6.94) Time: 0.258s, 5958.37/s (4.548s, 337.74/s) LR: 1.000e-04 Data: 0.037 (1.223)
Train: 0 [ 350/834 ( 42%)] Loss: 6.926 (6.93) Time: 22.307s, 68.86/s (4.716s, 325.72/s) LR: 1.000e-04 Data: 0.030 (1.055)
Train: 0 [ 400/834 ( 48%)] Loss: 6.923 (6.93) Time: 0.260s, 5903.79/s (4.831s, 317.92/s) LR: 1.000e-04 Data: 0.033 (1.003)
Train: 0 [ 450/834 ( 54%)] Loss: 6.930 (6.93) Time: 0.267s, 5750.61/s (4.895s, 313.80/s) LR: 1.000e-04 Data: 0.042 (0.940)
Train: 0 [ 500/834 ( 60%)] Loss: 6.924 (6.93) Time: 0.253s, 6072.91/s (4.987s, 308.01/s) LR: 1.000e-04 Data: 0.038 (0.878)
Train: 0 [ 550/834 ( 66%)] Loss: 6.927 (6.93) Time: 0.292s, 5269.00/s (5.098s, 301.29/s) LR: 1.000e-04 Data: 0.033 (0.859)
Train: 0 [ 600/834 ( 72%)] Loss: 6.923 (6.93) Time: 0.253s, 6066.72/s (5.119s, 300.08/s) LR: 1.000e-04 Data: 0.034 (0.909)
Train: 0 [ 650/834 ( 78%)] Loss: 6.927 (6.93) Time: 0.259s, 5922.92/s (5.138s, 298.97/s) LR: 1.000e-04 Data: 0.041 (1.092)
Train: 0 [ 700/834 ( 84%)] Loss: 6.927 (6.93) Time: 14.519s, 105.79/s (5.202s, 295.25/s) LR: 1.000e-04 Data: 14.272 (1.326)
Train: 0 [ 750/834 ( 90%)] Loss: 6.920 (6.93) Time: 5.617s, 273.48/s (5.242s, 292.99/s) LR: 1.000e-04 Data: 1.984 (1.522)
Train: 0 [ 800/834 ( 96%)] Loss: 6.921 (6.93) Time: 0.260s, 5896.42/s (5.285s, 290.62/s) LR: 1.000e-04 Data: 0.043 (1.484) |
Beta Was this translation helpful? Give feedback.
@rwightman