Has anyone tried adopting ffcv in timm to accerlerate training ? #1161

Doraemonzm · 2022-03-02T15:58:13Z

Doraemonzm
Mar 2, 2022

@rwightman, Thanks for your excellent work!
I am training my own model (a variant of MobileNet v3 ) on ImageNet with A100*8.
However, I found the training speed is not as fast as I expected.
Here is the script and part of my training log. Do you think it is reasonable?

sh scripts/distributed_train.sh 8 /data/public/imagenet2012 --model searched-ema -b 256 --sched step --epochs 300 --decay-epochs 2.4 --decay-rate .97 --opt rmsproptf --opt-eps .001 -j 16 --warmup-lr 1e-6 --weight-decay 1e-5 --drop 0.2 --drop-connect 0.2   --model-ema --model-ema-decay 0.9999  --aa rand-m9-mstd0.5 --remode pixel --reprob 0.2 --amp --lr .064

Train: 33 [   0/625 (  0%)]  Loss: 2.353 (2.35)  Time: 14.915s,  137.31/s  (14.915s,  137.31/s)  LR: 4.307e-02  Data: 14.237 (14.237)
Train: 33 [  50/625 (  8%)]  Loss: 2.388 (2.37)  Time: 0.800s, 2558.60/s  (1.113s, 1840.25/s)  LR: 4.307e-02  Data: 0.148 (0.405)
Train: 33 [ 100/625 ( 16%)]  Loss: 2.408 (2.38)  Time: 0.920s, 2225.21/s  (0.988s, 2072.95/s)  LR: 4.307e-02  Data: 0.130 (0.270)
Train: 33 [ 150/625 ( 24%)]  Loss: 2.347 (2.37)  Time: 0.866s, 2363.80/s  (0.943s, 2172.79/s)  LR: 4.307e-02  Data: 0.141 (0.225)
Train: 33 [ 200/625 ( 32%)]  Loss: 2.401 (2.38)  Time: 1.342s, 1525.82/s  (0.928s, 2207.11/s)  LR: 4.307e-02  Data: 0.130 (0.203)
Train: 33 [ 250/625 ( 40%)]  Loss: 2.423 (2.39)  Time: 0.831s, 2463.34/s  (0.914s, 2240.56/s)  LR: 4.307e-02  Data: 0.115 (0.188)
Train: 33 [ 300/625 ( 48%)]  Loss: 2.366 (2.38)  Time: 0.817s, 2508.20/s  (0.900s, 2275.86/s)  LR: 4.307e-02  Data: 0.129 (0.176)
Train: 33 [ 350/625 ( 56%)]  Loss: 2.396 (2.39)  Time: 0.852s, 2402.95/s  (0.892s, 2295.01/s)  LR: 4.307e-02  Data: 0.101 (0.168)
Train: 33 [ 400/625 ( 64%)]  Loss: 2.526 (2.40)  Time: 0.956s, 2141.70/s  (0.885s, 2314.55/s)  LR: 4.307e-02  Data: 0.137 (0.163)
Train: 33 [ 450/625 ( 72%)]  Loss: 2.429 (2.40)  Time: 0.799s, 2563.07/s  (0.879s, 2330.38/s)  LR: 4.307e-02  Data: 0.160 (0.159)
Train: 33 [ 500/625 ( 80%)]  Loss: 2.447 (2.41)  Time: 0.862s, 2376.46/s  (0.875s, 2339.96/s)  LR: 4.307e-02  Data: 0.128 (0.154)
Train: 33 [ 550/625 ( 88%)]  Loss: 2.477 (2.41)  Time: 0.792s, 2585.61/s  (0.871s, 2350.02/s)  LR: 4.307e-02  Data: 0.108 (0.150)
Train: 33 [ 600/625 ( 96%)]  Loss: 2.409 (2.41)  Time: 0.528s, 3881.56/s  (0.866s, 2364.25/s)  LR: 4.307e-02  Data: 0.064 (0.147)
Train: 33 [ 624/625 (100%)]  Loss: 2.467 (2.42)  Time: 0.430s, 4763.32/s  (0.852s, 2405.14/s)  LR: 4.307e-02  Data: 0.000 (0.143)
train one epoch time: 533.2229096889496
Distributing BatchNorm running means and vars
Test: [   0/24]  Time: 20.017 (20.017)  Loss:  0.7583 (0.7583)  Acc@1: 81.4453 (81.4453)  Acc@5: 95.7031 (95.7031)
Test: [  24/24]  Time: 0.083 (1.281)  Loss:  0.7417 (1.2107)  Acc@1: 82.6651 (71.5980)  Acc@5: 95.0472 (90.7900)
val one epoch time: 32.029253244400024
Test (EMA): [   0/24]  Time: 19.821 (19.821)  Loss:  0.7158 (0.7158)  Acc@1: 84.6191 (84.6191)  Acc@5: 95.8008 (95.8008)
Test (EMA): [  24/24]  Time: 0.083 (1.270)  Loss:  0.6978 (1.1468)  Acc@1: 84.3160 (73.7880)  Acc@5: 95.7547 (91.6260)

I found the CPU load stays at 100% during training so I speculate the bottleneck might lie in the data loading.
I failed to install FFCV since my environment is offline.
Do you think FFCV may help me in this case? Or are there some other ways to improve the training speed?
Any suggestion will be appreciated.

Answered by rwightman

Mar 3, 2022

@Doraemonzm it can be a pain to keep up to date (due to breaking package dependency naming), but I always install Pillow-SIMD in my training environments, it significantly impacts CPU use if that is constrained (and often is, especially with most recent GPUs like A100s). Many cloud A100s are rather underpowered in the CPU dept (in my opinion) so keeping data preprocessing efficient matters and that's why FFCV can have an impact.

Don't overdo the -j arg, 8 train processes * 16 worker processes = 128 workers + 8 train, I doubt you have 136 physical cores so you're just causing contention, for 8 GPU on same machine you likely want something between 4-8 workers per GPU.

If htop or whatever sy…

View full answer

rwightman · 2022-03-03T00:22:54Z

rwightman
Mar 3, 2022
Maintainer

@Doraemonzm it can be a pain to keep up to date (due to breaking package dependency naming), but I always install Pillow-SIMD in my training environments, it significantly impacts CPU use if that is constrained (and often is, especially with most recent GPUs like A100s). Many cloud A100s are rather underpowered in the CPU dept (in my opinion) so keeping data preprocessing efficient matters and that's why FFCV can have an impact.

Don't overdo the -j arg, 8 train processes * 16 worker processes = 128 workers + 8 train, I doubt you have 136 physical cores so you're just causing contention, for 8 GPU on same machine you likely want something between 4-8 workers per GPU.

If htop or whatever system explore you have shows heavy kernel use (typically red), you're wasting CPU on system level contention.

You can always use FFCV with any timm models, but lose all the setup preprocessing / aug / reg features (not in FFCV) and also bring in some issues like opencv's lack of anti-aliasing downsample,e tc.

1 reply

Doraemonzm Mar 3, 2022
Author

Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Has anyone tried adopting ffcv in timm to accerlerate training ? #1161

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Has anyone tried adopting ffcv in timm to accerlerate training ? #1161

Doraemonzm Mar 2, 2022

Replies: 1 comment · 1 reply

rwightman Mar 3, 2022 Maintainer

Doraemonzm Mar 3, 2022 Author

Doraemonzm
Mar 2, 2022

Replies: 1 comment 1 reply

rwightman
Mar 3, 2022
Maintainer

Doraemonzm Mar 3, 2022
Author