[FLINK-27286] Add infra to support training high dimension models #251
+3,783
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What is the purpose of the change
This PR introduces a communication infra that can distribute model parameters to multiple servers, so as to support training high dimension models in Flink ML. Moreover, it also introduces a programming API that can simplify the programming of iterative machine learning training process. The motivation cases include:
Brief change log
WorkerOperator
andServerOperator
to execute the training logic and maintain the model data, respectively.MLSession
to store the information that is alive during the training process onWorkerOperator
.Message
to represent the information that can be transfered among workers and servers.IterationStageList
andIterationStage
to describe the iterative machine learning process as a chain of computation/communication stages. Also, computation stage (ProcessStage
) and commonly used communication stage (Push/Pull/AllReduce/ReduceScatter) are also added.ModelUpdater
to let developers describe model updating logic.TrainingUtils
to ease the programming of iterative machine learning process.SharedLongArray
andSharedDoubleArray
to allow developers reuse the memory accross different iterations.Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation