-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
(I'm opening a new issue for this to start a conversation before submitting a pull request, so please let me know what you think.)
This adds new functionality to trainAsyc(), so that NodeJS users can utilize multiple gpus and cpus to train a single NeuralNetwork. This should significantly speed up training if you have a large neural net and/or a large training data set.
Is this a feature that we would want to merge into develop? [y/n]
Code
This branch, based on master, has a working example: massaroni/feature-parallel-training-m
This other branch is mergeable into develop, but develop is too unstable at this point to demo the multithreaded training. massaroni/feature-parallel-training
See the example in parallel-trainer-example.js. It basically just shows that the algorithm does converge.
See the main functionality in parallel-trainer.js
Documentation
trainAsync()
in parallel mode, can train a single net on multiple threads. This should speed up training for large nets, large training sets, or both.
Train a NeuralNetwork on 3 cpu threads.
const net = new brain.NeuralNetwork();
net
.trainAsync(data, {
parallel: {
threads: 3,
partitionSize: 1500, // optional. send a partition of 1500 items from the training set to each thread. Raise this number to get some overlap in the training data partitions.
epochs: 20000, // optional. limit each thread to 20,000 training runs
},
// ... and the usual training options
})
.then(res => {
// do something with my trained network
})
.catch(handleError);
Train a NeuralNetwork on 6 cpu threads and 2 GPU threads.
const net = new brain.NeuralNetwork();
net
.trainAsync(data, {
parallel: {
threads: {
NeuralNetwork: 6,
NeuralNetworkGPU: 2
}
},
// ... and the usual training options
})
.then(res => {
// do something with my trained network
})
.catch(handleError);
Train a single NeuralNetwork on 6 cpu threads and 2 GPU threads, and send 10x more training data to the GPUs because they can run through it faster.
const net = new brain.NeuralNetwork();
net
.trainAsync(data, {
parallel: {
threads: {
NeuralNetwork: {
threads: 6,
trainingDataSize: 2200
},
NeuralNetworkGPU: {
threads: 2,
trainingDataSize: 22000
}
}
},
// ... and the usual training options
})
.then(res => {
// do something with my trained network
})
.catch(handleError);
Roadmap
- support all other neural net types
- web workers, for multithreaded training in the browser
- distributed training (multiple machines) (async SGD w/stale gradient handling?)
Activity
mubaidr commentedon Jul 15, 2019
Well this sounds great! Just to update you GPU support is already on the way which will make brain.js super fast both in browser and nodejs environment, without requiring anything form the user side.
Coming back to this implementation, I would love hear how you are implementing this feature, like theoretically and does it actually works (it should reduce iterations or training time of the network)?
In my quick tests this does not seem to help, in both cases training iterations are more or less same: https://repl.it/repls/WindyBossySymbol
output:
Am I missing something?
massaroni commentedon Jul 15, 2019
Thanks @mubaidr, this is based on parameter averaging and data parallelization. It's probably the most naive implementation possible, but that's a good start because it's easy to test, and it's running on a single machine anyway. The more sophisticated algorithms are mostly trying to deal with architectural challenges like I/O overhead and mismatching machines, so maybe we can still benefit from the naive implementation, on a single machine.
Basically, it splits the training data into partitions, one per thread, and each thread has a clone of the neural net. Each thread trains on its own partition, and then the trained nets are averaged together (mean average of corresponding weights in the nets). Then each thread is re-seeded with clones of the averaged net, rinse and repeat.
I think overall we can expect that compared to single threaded training, this algorithm is always going to run through more total iterations. Ideally it should finish with fewer iterations per thread, so that training is faster. Along the way, each thread is converging in a slightly different direction, toward a local minimum in its assigned partition. If your data set has dramatic local minima, then you can configure the partitions to have some overlap, and I think that should help.
That said, the xor data is a poor use case for multithreaded training because it's so small and the local minima are pretty deep. There are only 4 training data points, so the Repl.it example with 8 cpu threads doesn't even have enough data for 1 training point per thread. I think that the only value in this example is just to show that it does converge at all.
I'd like to run some benchmarks on a large data set to quantify the performance gains. Do you have a favorite large example data set that I can test it on? My personal use case is too messy to publish here.
mubaidr commentedon Jul 15, 2019
Interesting. Thanks for the explanation. 👍
Well in that case I believe some image data based training or something like this would be helpful to this behavior: https://jsfiddle.net/8Lvynxz5/38/ I will do some testing in my free time.
Keep up the great work!
massaroni commentedon Jul 15, 2019
Thanks @mubaidr, I found some image data sets on these sites, below. I'll pick a good one and run some benchmarks. Based on my schedule this week, I'll probably have an update about my findings in a few days.
deeplearning.net - Datasets
skymind - Open Datasets
Rocketblaster247 commentedon Jul 15, 2019
Anything to make training faster!
massaroni commentedon Aug 21, 2019
My findings:
It looks like the multithreaded trainer can get better performance than a single thread. The results are encouraging so far: with a low learning rate and good tuning, I'm getting ~10,000x better performance, which is very surprising, and with normal learning rates, I'm getting a roughly sub-linear performance gain wrt thread count, as expected. This is by no means an exhaustive analysis, so I'm presenting it as an Alpha, looking forward to getting some feedback. I could go nuts charting results in all different scenarios, but I think this is a good proof of concept so far, and I want to get some feedback about it first.
Methods:
In this benchmark, performance is measured both in wall clock time, and item-iterations per thread (data size * iterations). Compared to single-threaded training, I expected multithreaded training to get a lower wall clock time, more total iterations, and fewer item-iterations per thread. I expected the performance boost to be roughly linear, proportional to the number of threads.
This feature is designed for a large data set or a large net, or both. So at first, I started testing on the MNIST Hand Written Digit database, which is rather large, and requires a large net to process it. However, my cycle time was too slow, so I had to find a smaller data set to work with. I ended up doing a simple math function approximation, because it was easy to generate the data on the fly and quick to test that it'll converge on a single thread. Then I dialed down the learning rate, simply to necessitate more training iterations so that we can show a more robust comparison. Note that I did commit the code that reads in the MNIST database, so that's still included in this branch.
Conclusions:
With a low learning rate, I'm getting multiple orders of magnitude better performance. With more common learning rates, I'm getting a modest linear performance boost, as expected.
Multithreaded training requires new tuning considerations, and with bad tuning, it could be slower than a single thread or not converge at all.
New tuning considerations:
When your data set is partitioned, each partition should have enough data for a net to train on it in isolation and still converge. If you don't have enough redundancy in your data set, then you can configure the partitions to overlap. This also suggests that there's a limit to the number of threads you can utilize, depending on the data set and partitioning.
You want to limit the iterations for trainer threads to a very low number, so that they can merge their results frequently enough. I had good results with a per-thread iteration limit of 10 or less.
You can run the benchmarks yourself, with this script:
benchmark/index.js (based on master)
benchmark/index.js (based on develop)
and you should get some results like this:
Thoughts?
robertleeplummerjr commentedon Aug 22, 2019
I think this is simply fantastic! I'm trying to focus my efforts on getting GPU and the new api for network composition finished and tested. I say continue and when you get your end more polished and mine as well, we'll converge?
Curious, in brain we use "iterations", would you be opposed to changing "epochs" to match? That is of course if they are synonymous.
robertleeplummerjr commentedon Aug 22, 2019
The amount of typos that I had to correct from my own typing had me believing my phone to be possessed...
Joinfield commentedon Aug 22, 2019
Just amazing! Especially multi threaded GPU training!
7 remaining items
mubaidr commentedon May 10, 2020
Its in Beta state, expected to release very soon.
We do have plans to implement this too. This might help users with powerful CPUs without GPU.
goferito commentedon May 10, 2020
If I understood it correctly it even allows using both CPU and GPU, right? That would be really cool. Thanks a lot guys for doing this.
mubaidr commentedon May 10, 2020
Yes, exactly. But how much it will actually effect performance when using GPU (GPU is already many times faster than cpu) is yet to be seen.
bor8 commentedon Jul 11, 2020
I would like to donate something if I could use four cores at the same time instead of one, in my case! Is this also intended for LSTM (not LSTMTimeStep)?
blackforest-t commentedon Oct 2, 2020
i'll like to know about it too
unicorn-style commentedon Nov 29, 2021
problems...1 thread faster than 1+
vm vcpu 4 esxi
node 16
type modules change to commonjs for working
trying my dataset and my script – 1 faster then when you set configuration with .... parallel...
Any thoughts?
////// Benchmark Results ////// LSTMTimeStep 4 Threads 0.00005 LR runtime = 2002 seconds item iterations per thread = 80675 error = 0.00019999808332483684 test error = 0.01930622798077797 LSTMTimeStep 3 Threads 0.00005 LR runtime = 2357 seconds item iterations per thread = 108120 error = 0.0001999364159483876 test error = 0.01775701457418095 LSTMTimeStep 2 Threads 0.00005 LR runtime = 716 seconds item iterations per thread = 53340 error = 0.0001997283394060407 test error = 0.01856114064901209 LSTMTimeStep Single Thread 0.00005 LR runtime = 3973 seconds item iterations per thread = 95746200 error = 0.00019995401834603388 test error = 0.014280172646737196 2 Threads 0.0001 LR Overlapping Partitions runtime = 0 seconds item iterations per thread = 19200 error = 0.004623700066624818 test error = 0.006383759797960875 4 Threads 0.0001 LR Overlapping Partitions runtime = 0 seconds item iterations per thread = 15500 error = 0.003654747346571941 test error = 0.007451617616469838 Single-thread 0.001 LR runtime = 0 seconds item iterations per thread = 0 error = 0.003913228313435196 test error = 0.007906918111712248 2 Threads 0.001 LR Overlapping Partitions runtime = 0 seconds item iterations per thread = 8000 error = 0.004012907401541029 test error = 0.006790354935259165 Single-thread 0.01 LR runtime = 0 seconds item iterations per thread = 0 error = 0.0005127324836436953 test error = 0.014539566666571174 2 Threads 0.01 LR Overlapping Partitions runtime = 0 seconds item iterations per thread = 3200 error = 0.00024371662900630114 test error = 0.008307541957539067
richiedevs commentedon Jul 5, 2022
I'd love Multithreaded Training
imkane commentedon Nov 27, 2022
Can't wait for this to be available for
LSTMTimeStep
😁shestakov-vladyslav commentedon Feb 16, 2023
Desired
gokaybiz commentedon Nov 7, 2023
Same here in 2023