Skip to content

New Feature: Multithreaded Training #417

@massaroni

Description

@massaroni
Contributor

A GIF or MEME to give some spice of the internet

(I'm opening a new issue for this to start a conversation before submitting a pull request, so please let me know what you think.)

This adds new functionality to trainAsyc(), so that NodeJS users can utilize multiple gpus and cpus to train a single NeuralNetwork. This should significantly speed up training if you have a large neural net and/or a large training data set.

Is this a feature that we would want to merge into develop? [y/n]

Code

This branch, based on master, has a working example: massaroni/feature-parallel-training-m
This other branch is mergeable into develop, but develop is too unstable at this point to demo the multithreaded training. massaroni/feature-parallel-training

See the example in parallel-trainer-example.js. It basically just shows that the algorithm does converge.
See the main functionality in parallel-trainer.js

Documentation

trainAsync() in parallel mode, can train a single net on multiple threads. This should speed up training for large nets, large training sets, or both.

Train a NeuralNetwork on 3 cpu threads.

  const net = new brain.NeuralNetwork();
  net
    .trainAsync(data, {
      parallel: {
        threads: 3,
        partitionSize: 1500, // optional. send a partition of 1500 items from the training set to each thread.  Raise this number to get some overlap in the training data partitions.
        epochs: 20000, // optional. limit each thread to 20,000 training runs
      },
      // ... and the usual training options
    })
    .then(res => {
      // do something with my trained network
    })
    .catch(handleError);

Train a NeuralNetwork on 6 cpu threads and 2 GPU threads.

  const net = new brain.NeuralNetwork();
  net
    .trainAsync(data, {
      parallel: {
        threads: {
          NeuralNetwork: 6,
          NeuralNetworkGPU: 2
        }
      },
      // ... and the usual training options
    })
    .then(res => {
      // do something with my trained network
    })
    .catch(handleError);

Train a single NeuralNetwork on 6 cpu threads and 2 GPU threads, and send 10x more training data to the GPUs because they can run through it faster.

  const net = new brain.NeuralNetwork();
  net
    .trainAsync(data, {
      parallel: {
        threads: {
          NeuralNetwork: {
            threads: 6,
            trainingDataSize: 2200
          },
          NeuralNetworkGPU: {
            threads: 2,
            trainingDataSize: 22000
          }
        }
      },
      // ... and the usual training options
    })
    .then(res => {
      // do something with my trained network
    })
    .catch(handleError);

Roadmap

  • support all other neural net types
  • web workers, for multithreaded training in the browser
  • distributed training (multiple machines) (async SGD w/stale gradient handling?)

Activity

mubaidr

mubaidr commented on Jul 15, 2019

@mubaidr
Contributor

Well this sounds great! Just to update you GPU support is already on the way which will make brain.js super fast both in browser and nodejs environment, without requiring anything form the user side.

Coming back to this implementation, I would love hear how you are implementing this feature, like theoretically and does it actually works (it should reduce iterations or training time of the network)?

In my quick tests this does not seem to help, in both cases training iterations are more or less same: https://repl.it/repls/WindyBossySymbol

output:

iterations: 100, training error: 0.25818311882582456
iterations: 200, training error: 0.25800706443927357
iterations: 300, training error: 0.2578366663269325
iterations: 400, training error: 0.2576723716144711
iterations: 500, training error: 0.2575143128648252
iterations: 600, training error: 0.2573622537363729
iterations: 700, training error: 0.2572154941867293
iterations: 800, training error: 0.25707281095860957
iterations: 900, training error: 0.25693213970593565
iterations: 1000, training error: 0.25679010187229867
iterations: 1100, training error: 0.2566410622835251
iterations: 1200, training error: 0.2564750828459159
iterations: 1300, training error: 0.2562730074033759
iterations: 1400, training error: 0.2559948550585196
iterations: 1500, training error: 0.2555482658486705
iterations: 1600, training error: 0.2547033046119731
iterations: 1700, training error: 0.25288047943778835
iterations: 1800, training error: 0.24880152404029698
iterations: 1900, training error: 0.24041909038012865
iterations: 2000, training error: 0.22613661489458114
iterations: 2100, training error: 0.2085759876262676
iterations: 2200, training error: 0.19321029472878642
iterations: 2300, training error: 0.1805968646341638
iterations: 2400, training error: 0.1681289670890776
iterations: 2500, training error: 0.15277506975308214
iterations: 2600, training error: 0.13133523941899217
iterations: 2700, training error: 0.10298459101445756
iterations: 2800, training error: 0.07308997065739647
iterations: 2900, training error: 0.04928476756839584
iterations: 3000, training error: 0.03366073295542782
iterations: 3100, training error: 0.02403211479609829
iterations: 3200, training error: 0.018004526605472707
iterations: 3300, training error: 0.014065349103011215
iterations: 3400, training error: 0.011368400954631375
iterations: 3500, training error: 0.009442459782012247
iterations: 3600, training error: 0.008016518676355791
iterations: 3700, training error: 0.006928103610264663
iterations: 3800, training error: 0.006075717862402963
iterations: 3900, training error: 0.0053935231070073725
{ error: 0.004998328924858023, iterations: 3969 }
normal: 5792.452ms
iterations: 100, training error: 0.2583618614108776
iterations: 200, training error: 0.2581863411353922
iterations: 300, training error: 0.258014347597987
iterations: 400, training error: 0.25784683062415414
iterations: 500, training error: 0.2576844676886525
iterations: 600, training error: 0.257527535550679
iterations: 700, training error: 0.2573757396917138
iterations: 800, training error: 0.2572280633955959
iterations: 900, training error: 0.2570824604326649
iterations: 1000, training error: 0.2569354179554155
iterations: 1100, training error: 0.2567806937857542
iterations: 1200, training error: 0.2566068921395175
iterations: 1300, training error: 0.25639189843085575
iterations: 1400, training error: 0.2560892586667549
iterations: 1500, training error: 0.25559406375235505
iterations: 1600, training error: 0.2546562652210833
iterations: 1700, training error: 0.2526846248860013
iterations: 1800, training error: 0.24843476345960558
iterations: 1900, training error: 0.2398532284194126
iterations: 2000, training error: 0.225266258589784
iterations: 2100, training error: 0.2075494480327098
iterations: 2200, training error: 0.19200110866068304
iterations: 2300, training error: 0.17873487036086755
iterations: 2400, training error: 0.16502797542351472
iterations: 2500, training error: 0.1476880848635918
iterations: 2600, training error: 0.12383270000032548
iterations: 2700, training error: 0.0943299946069216
iterations: 2800, training error: 0.06581099531372298
iterations: 2900, training error: 0.04443505464640163
iterations: 3000, training error: 0.03069075213928414
iterations: 3100, training error: 0.022194948138988882
iterations: 3200, training error: 0.01681723219724285
iterations: 3300, training error: 0.01325963734705865
iterations: 3400, training error: 0.010796844203295422
iterations: 3500, training error: 0.009021366138514925
iterations: 3600, training error: 0.0076962426081793045
iterations: 3700, training error: 0.006677933082800236
iterations: 3800, training error: 0.005875889767725823
iterations: 3900, training error: 0.005230826511272005
{ error: 0.0049966083428647536, iterations: 3942 }
parallel: 5756.476ms

Am I missing something?

massaroni

massaroni commented on Jul 15, 2019

@massaroni
ContributorAuthor

Thanks @mubaidr, this is based on parameter averaging and data parallelization. It's probably the most naive implementation possible, but that's a good start because it's easy to test, and it's running on a single machine anyway. The more sophisticated algorithms are mostly trying to deal with architectural challenges like I/O overhead and mismatching machines, so maybe we can still benefit from the naive implementation, on a single machine.

Basically, it splits the training data into partitions, one per thread, and each thread has a clone of the neural net. Each thread trains on its own partition, and then the trained nets are averaged together (mean average of corresponding weights in the nets). Then each thread is re-seeded with clones of the averaged net, rinse and repeat.

I think overall we can expect that compared to single threaded training, this algorithm is always going to run through more total iterations. Ideally it should finish with fewer iterations per thread, so that training is faster. Along the way, each thread is converging in a slightly different direction, toward a local minimum in its assigned partition. If your data set has dramatic local minima, then you can configure the partitions to have some overlap, and I think that should help.

That said, the xor data is a poor use case for multithreaded training because it's so small and the local minima are pretty deep. There are only 4 training data points, so the Repl.it example with 8 cpu threads doesn't even have enough data for 1 training point per thread. I think that the only value in this example is just to show that it does converge at all.

I'd like to run some benchmarks on a large data set to quantify the performance gains. Do you have a favorite large example data set that I can test it on? My personal use case is too messy to publish here.

mubaidr

mubaidr commented on Jul 15, 2019

@mubaidr
Contributor

Thanks @mubaidr, this is based on parameter averaging and data parallelization. It's probably the most naive implementation possible, but that's a good start because it's easy to test, and it's running on a single machine anyway. The more sophisticated algorithms are mostly trying to deal with architectural challenges like I/O overhead and mismatching machines, so maybe we can still benefit from the naive implementation, on a single machine.

Basically, it splits the training data into partitions, one per thread, and each thread has a clone of the neural net. Each thread trains on its own partition, and then the trained nets are averaged together (mean average of corresponding weights in the nets). Then each thread is re-seeded with clones of the averaged net, rinse and repeat.

Interesting. Thanks for the explanation. 👍

That said, the xor data is a poor use case for multithreaded training because it's so small and the local minima are pretty deep. There are only 4 training data points, so the Repl.it example with 8 cpu threads doesn't even have enough data for 1 training point per thread. I think that the only value in this example is just to show that it does converge at all.

I'd like to run some benchmarks on a large data set to quantify the performance gains. Do you have a favorite large example data set that I can test it on? My personal use case is too messy to publish here.

Well in that case I believe some image data based training or something like this would be helpful to this behavior: https://jsfiddle.net/8Lvynxz5/38/ I will do some testing in my free time.

Keep up the great work!

massaroni

massaroni commented on Jul 15, 2019

@massaroni
ContributorAuthor

Thanks @mubaidr, I found some image data sets on these sites, below. I'll pick a good one and run some benchmarks. Based on my schedule this week, I'll probably have an update about my findings in a few days.

deeplearning.net - Datasets
skymind - Open Datasets

Rocketblaster247

Rocketblaster247 commented on Jul 15, 2019

@Rocketblaster247

Anything to make training faster!

massaroni

massaroni commented on Aug 21, 2019

@massaroni
ContributorAuthor

My findings:

It looks like the multithreaded trainer can get better performance than a single thread. The results are encouraging so far: with a low learning rate and good tuning, I'm getting ~10,000x better performance, which is very surprising, and with normal learning rates, I'm getting a roughly sub-linear performance gain wrt thread count, as expected. This is by no means an exhaustive analysis, so I'm presenting it as an Alpha, looking forward to getting some feedback. I could go nuts charting results in all different scenarios, but I think this is a good proof of concept so far, and I want to get some feedback about it first.

Methods:
In this benchmark, performance is measured both in wall clock time, and item-iterations per thread (data size * iterations). Compared to single-threaded training, I expected multithreaded training to get a lower wall clock time, more total iterations, and fewer item-iterations per thread. I expected the performance boost to be roughly linear, proportional to the number of threads.

This feature is designed for a large data set or a large net, or both. So at first, I started testing on the MNIST Hand Written Digit database, which is rather large, and requires a large net to process it. However, my cycle time was too slow, so I had to find a smaller data set to work with. I ended up doing a simple math function approximation, because it was easy to generate the data on the fly and quick to test that it'll converge on a single thread. Then I dialed down the learning rate, simply to necessitate more training iterations so that we can show a more robust comparison. Note that I did commit the code that reads in the MNIST database, so that's still included in this branch.

Conclusions:

  • With a low learning rate, I'm getting multiple orders of magnitude better performance. With more common learning rates, I'm getting a modest linear performance boost, as expected.

  • Multithreaded training requires new tuning considerations, and with bad tuning, it could be slower than a single thread or not converge at all.

  • New tuning considerations:

  1. When your data set is partitioned, each partition should have enough data for a net to train on it in isolation and still converge. If you don't have enough redundancy in your data set, then you can configure the partitions to overlap. This also suggests that there's a limit to the number of threads you can utilize, depending on the data set and partitioning.

  2. You want to limit the iterations for trainer threads to a very low number, so that they can merge their results frequently enough. I had good results with a per-thread iteration limit of 10 or less.

You can run the benchmarks yourself, with this script:
benchmark/index.js (based on master)
benchmark/index.js (based on develop)

node benchmark/

and you should get some results like this:

////// Benchmark Results //////
Single-thread 0.0001 LR
     runtime =  754 seconds
     item iterations per thread =  154380000
     error =  0.004999092472072885
     test error =  0.005060350515187244
2 Threads 0.0001 LR Overlapping Partitions
     runtime =  11 seconds
     item iterations per thread =  17600
     error =  0.004849751772040714
     test error =  0.006833252998457533
4 Threads 0.0001 LR Overlapping Partitions
     runtime =  6 seconds
     item iterations per thread =  10000
     error =  0.004585786514230911
     test error =  0.007337232050197451
Single-thread 0.001 LR
     runtime =  6 seconds
     item iterations per thread =  15000
     error =  0.004750009322000309
     test error =  0.00805988373712574
2 Threads 0.001 LR Overlapping Partitions
     runtime =  5 seconds
     item iterations per thread =  8000
     error =  0.0030617683728257285
     test error =  0.00814376965453241
Single-thread 0.01 LR
     runtime =  3 seconds
     item iterations per thread =  5000
     error =  0.000601457041516575
     test error =  0.014998172604486229
2 Threads 0.01 LR Overlapping Partitions
     runtime =  1 seconds
     item iterations per thread =  3200
     error =  0.0011902072636906383
     test error =  0.00786050391168138

Thoughts?

robertleeplummerjr

robertleeplummerjr commented on Aug 22, 2019

@robertleeplummerjr
Contributor

I think this is simply fantastic! I'm trying to focus my efforts on getting GPU and the new api for network composition finished and tested. I say continue and when you get your end more polished and mine as well, we'll converge?

Curious, in brain we use "iterations", would you be opposed to changing "epochs" to match? That is of course if they are synonymous.

robertleeplummerjr

robertleeplummerjr commented on Aug 22, 2019

@robertleeplummerjr
Contributor

The amount of typos that I had to correct from my own typing had me believing my phone to be possessed...

Joinfield

Joinfield commented on Aug 22, 2019

@Joinfield

Just amazing! Especially multi threaded GPU training!

7 remaining items

mubaidr

mubaidr commented on May 10, 2020

@mubaidr
Contributor

Its in Beta state, expected to release very soon.

We do have plans to implement this too. This might help users with powerful CPUs without GPU.

goferito

goferito commented on May 10, 2020

@goferito

If I understood it correctly it even allows using both CPU and GPU, right? That would be really cool. Thanks a lot guys for doing this.

mubaidr

mubaidr commented on May 10, 2020

@mubaidr
Contributor

Yes, exactly. But how much it will actually effect performance when using GPU (GPU is already many times faster than cpu) is yet to be seen.

bor8

bor8 commented on Jul 11, 2020

@bor8

I would like to donate something if I could use four cores at the same time instead of one, in my case! Is this also intended for LSTM (not LSTMTimeStep)?

blackforest-t

blackforest-t commented on Oct 2, 2020

@blackforest-t

I would like to donate something if I could use four cores at the same time instead of one, in my case! Is this also intended for LSTM (not LSTMTimeStep)?

i'll like to know about it too

unicorn-style

unicorn-style commented on Nov 29, 2021

@unicorn-style

problems...1 thread faster than 1+

vm vcpu 4 esxi
node 16
type modules change to commonjs for working
trying my dataset and my script – 1 faster then when you set configuration with .... parallel...

Any thoughts?

////// Benchmark Results ////// LSTMTimeStep 4 Threads 0.00005 LR runtime = 2002 seconds item iterations per thread = 80675 error = 0.00019999808332483684 test error = 0.01930622798077797 LSTMTimeStep 3 Threads 0.00005 LR runtime = 2357 seconds item iterations per thread = 108120 error = 0.0001999364159483876 test error = 0.01775701457418095 LSTMTimeStep 2 Threads 0.00005 LR runtime = 716 seconds item iterations per thread = 53340 error = 0.0001997283394060407 test error = 0.01856114064901209 LSTMTimeStep Single Thread 0.00005 LR runtime = 3973 seconds item iterations per thread = 95746200 error = 0.00019995401834603388 test error = 0.014280172646737196 2 Threads 0.0001 LR Overlapping Partitions runtime = 0 seconds item iterations per thread = 19200 error = 0.004623700066624818 test error = 0.006383759797960875 4 Threads 0.0001 LR Overlapping Partitions runtime = 0 seconds item iterations per thread = 15500 error = 0.003654747346571941 test error = 0.007451617616469838 Single-thread 0.001 LR runtime = 0 seconds item iterations per thread = 0 error = 0.003913228313435196 test error = 0.007906918111712248 2 Threads 0.001 LR Overlapping Partitions runtime = 0 seconds item iterations per thread = 8000 error = 0.004012907401541029 test error = 0.006790354935259165 Single-thread 0.01 LR runtime = 0 seconds item iterations per thread = 0 error = 0.0005127324836436953 test error = 0.014539566666571174 2 Threads 0.01 LR Overlapping Partitions runtime = 0 seconds item iterations per thread = 3200 error = 0.00024371662900630114 test error = 0.008307541957539067

richiedevs

richiedevs commented on Jul 5, 2022

@richiedevs

I'd love Multithreaded Training

imkane

imkane commented on Nov 27, 2022

@imkane

Can't wait for this to be available for LSTMTimeStep 😁

shestakov-vladyslav

shestakov-vladyslav commented on Feb 16, 2023

@shestakov-vladyslav

Desired

gokaybiz

gokaybiz commented on Nov 7, 2023

@gokaybiz

I would like to donate something if I could use four cores at the same time instead of one, in my case! Is this also intended for LSTM (not LSTMTimeStep)?

Same here in 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @robertleeplummerjr@goferito@imkane@mubaidr@massaroni

      Issue actions

        New Feature: Multithreaded Training · Issue #417 · BrainJS/brain.js