Has anyone tried adopting ffcv in timm to accerlerate training ? #1161
-
@rwightman, Thanks for your excellent work! sh scripts/distributed_train.sh 8 /data/public/imagenet2012 --model searched-ema -b 256 --sched step --epochs 300 --decay-epochs 2.4 --decay-rate .97 --opt rmsproptf --opt-eps .001 -j 16 --warmup-lr 1e-6 --weight-decay 1e-5 --drop 0.2 --drop-connect 0.2 --model-ema --model-ema-decay 0.9999 --aa rand-m9-mstd0.5 --remode pixel --reprob 0.2 --amp --lr .064 Train: 33 [ 0/625 ( 0%)] Loss: 2.353 (2.35) Time: 14.915s, 137.31/s (14.915s, 137.31/s) LR: 4.307e-02 Data: 14.237 (14.237)
Train: 33 [ 50/625 ( 8%)] Loss: 2.388 (2.37) Time: 0.800s, 2558.60/s (1.113s, 1840.25/s) LR: 4.307e-02 Data: 0.148 (0.405)
Train: 33 [ 100/625 ( 16%)] Loss: 2.408 (2.38) Time: 0.920s, 2225.21/s (0.988s, 2072.95/s) LR: 4.307e-02 Data: 0.130 (0.270)
Train: 33 [ 150/625 ( 24%)] Loss: 2.347 (2.37) Time: 0.866s, 2363.80/s (0.943s, 2172.79/s) LR: 4.307e-02 Data: 0.141 (0.225)
Train: 33 [ 200/625 ( 32%)] Loss: 2.401 (2.38) Time: 1.342s, 1525.82/s (0.928s, 2207.11/s) LR: 4.307e-02 Data: 0.130 (0.203)
Train: 33 [ 250/625 ( 40%)] Loss: 2.423 (2.39) Time: 0.831s, 2463.34/s (0.914s, 2240.56/s) LR: 4.307e-02 Data: 0.115 (0.188)
Train: 33 [ 300/625 ( 48%)] Loss: 2.366 (2.38) Time: 0.817s, 2508.20/s (0.900s, 2275.86/s) LR: 4.307e-02 Data: 0.129 (0.176)
Train: 33 [ 350/625 ( 56%)] Loss: 2.396 (2.39) Time: 0.852s, 2402.95/s (0.892s, 2295.01/s) LR: 4.307e-02 Data: 0.101 (0.168)
Train: 33 [ 400/625 ( 64%)] Loss: 2.526 (2.40) Time: 0.956s, 2141.70/s (0.885s, 2314.55/s) LR: 4.307e-02 Data: 0.137 (0.163)
Train: 33 [ 450/625 ( 72%)] Loss: 2.429 (2.40) Time: 0.799s, 2563.07/s (0.879s, 2330.38/s) LR: 4.307e-02 Data: 0.160 (0.159)
Train: 33 [ 500/625 ( 80%)] Loss: 2.447 (2.41) Time: 0.862s, 2376.46/s (0.875s, 2339.96/s) LR: 4.307e-02 Data: 0.128 (0.154)
Train: 33 [ 550/625 ( 88%)] Loss: 2.477 (2.41) Time: 0.792s, 2585.61/s (0.871s, 2350.02/s) LR: 4.307e-02 Data: 0.108 (0.150)
Train: 33 [ 600/625 ( 96%)] Loss: 2.409 (2.41) Time: 0.528s, 3881.56/s (0.866s, 2364.25/s) LR: 4.307e-02 Data: 0.064 (0.147)
Train: 33 [ 624/625 (100%)] Loss: 2.467 (2.42) Time: 0.430s, 4763.32/s (0.852s, 2405.14/s) LR: 4.307e-02 Data: 0.000 (0.143)
train one epoch time: 533.2229096889496
Distributing BatchNorm running means and vars
Test: [ 0/24] Time: 20.017 (20.017) Loss: 0.7583 (0.7583) Acc@1: 81.4453 (81.4453) Acc@5: 95.7031 (95.7031)
Test: [ 24/24] Time: 0.083 (1.281) Loss: 0.7417 (1.2107) Acc@1: 82.6651 (71.5980) Acc@5: 95.0472 (90.7900)
val one epoch time: 32.029253244400024
Test (EMA): [ 0/24] Time: 19.821 (19.821) Loss: 0.7158 (0.7158) Acc@1: 84.6191 (84.6191) Acc@5: 95.8008 (95.8008)
Test (EMA): [ 24/24] Time: 0.083 (1.270) Loss: 0.6978 (1.1468) Acc@1: 84.3160 (73.7880) Acc@5: 95.7547 (91.6260) I found the CPU load stays at 100% during training so I speculate the bottleneck might lie in the data loading. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
@Doraemonzm it can be a pain to keep up to date (due to breaking package dependency naming), but I always install Pillow-SIMD in my training environments, it significantly impacts CPU use if that is constrained (and often is, especially with most recent GPUs like A100s). Many cloud A100s are rather underpowered in the CPU dept (in my opinion) so keeping data preprocessing efficient matters and that's why FFCV can have an impact. Don't overdo the If You can always use FFCV with any timm models, but lose all the setup preprocessing / aug / reg features (not in FFCV) and also bring in some issues like opencv's lack of anti-aliasing downsample,e tc. |
Beta Was this translation helpful? Give feedback.
@Doraemonzm it can be a pain to keep up to date (due to breaking package dependency naming), but I always install Pillow-SIMD in my training environments, it significantly impacts CPU use if that is constrained (and often is, especially with most recent GPUs like A100s). Many cloud A100s are rather underpowered in the CPU dept (in my opinion) so keeping data preprocessing efficient matters and that's why FFCV can have an impact.
Don't overdo the
-j
arg, 8 train processes * 16 worker processes = 128 workers + 8 train, I doubt you have 136 physical cores so you're just causing contention, for 8 GPU on same machine you likely want something between 4-8 workers per GPU.If
htop
or whatever sy…