[FLINK-27286] Add infra to support training high dimension models #251

zhipeng93 · 2023-08-15T01:52:09Z

What is the purpose of the change

This PR introduces a communication infra that can distribute model parameters to multiple servers, so as to support training high dimension models in Flink ML. Moreover, it also introduces a programming API that can simplify the programming of iterative machine learning training process. The motivation cases include:

Train a high dimension model whose model data cannot be fitted on a single machine.
Support the case where the model data of each key as a double or double[].
Support model data with discrete index.
Use a set of index to retrieve the model values.
Use a set of index and UDF to retrieve the statistics computed from model values.
Use a set of index/value to update the model values.
Use a set of index/value to update the model, with user specified aggregation logics.
Supports AllReduce/ReduceScatter for java generics.
Supports AllReduce/ReduceScatter to execute every K iterations.
Support the case that user can access the training data in each iteration.
Output model data to downstream tasks.
Output the intermediate result to downstream tasks.

Brief change log

Introduced WorkerOperator and ServerOperator to execute the training logic and maintain the model data, respectively.
Added MLSession to store the information that is alive during the training process on WorkerOperator.
Added Message to represent the information that can be transfered among workers and servers.
Added IterationStageList and IterationStage to describe the iterative machine learning process as a chain of computation/communication stages. Also, computation stage (ProcessStage) and commonly used communication stage (Push/Pull/AllReduce/ReduceScatter) are also added.
Added ModelUpdater to let developers describe model updating logic.
Added TrainingUtils to ease the programming of iterative machine learning process.
Added SharedLongArray and SharedDoubleArray to allow developers reuse the memory accross different iterations.
Added unit test to cover the above functionalities.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)

Documentation

Does this pull request introduce a new feature? (yes)
If yes, how is the feature documented? (JavaDocs)

…ion models

zhipeng93 · 2023-08-15T08:42:11Z

Hi @lindong28 @Fanoid @weibozhao @jiangxin369 , can you help to review this PR?

The PR before [1] is a bit complicated since it involves too many changes (i.e., infra, vectors, algorithms). So I extract the infra here as a separate PR. The comments left in [1] are addressed in this PR.

[1] #237

zhipeng93 mentioned this pull request Aug 15, 2023

[Flink-27826] Support training very high dimensional logistic regression #237

Closed

zhipeng93 force-pushed the FLINK-27826-infra branch from b0ccf71 to 169267b Compare August 15, 2023 07:44

zhipeng93 changed the title ~~[Draft][FLINK-27286] Add communication infra to support training high dimens…~~ [FLINK-27286] Add communication infra to support training high dimension models Aug 15, 2023

[FLINK-27286] Add communication infra to support training high dimens…

33f1f39

…ion models

zhipeng93 force-pushed the FLINK-27826-infra branch from 169267b to 33f1f39 Compare August 15, 2023 08:34

zhipeng93 changed the title ~~[FLINK-27286] Add communication infra to support training high dimension models~~ [FLINK-27286] Add infra to support training high dimension models Aug 15, 2023

zhipeng93 closed this Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-27286] Add infra to support training high dimension models #251

[FLINK-27286] Add infra to support training high dimension models #251

Uh oh!

zhipeng93 commented Aug 15, 2023 •

edited

Loading

Uh oh!

zhipeng93 commented Aug 15, 2023 •

edited

Loading

Uh oh!

Uh oh!

[FLINK-27286] Add infra to support training high dimension models #251

[FLINK-27286] Add infra to support training high dimension models #251

Uh oh!

Conversation

zhipeng93 commented Aug 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

zhipeng93 commented Aug 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

zhipeng93 commented Aug 15, 2023 •

edited

Loading

zhipeng93 commented Aug 15, 2023 •

edited

Loading