Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-27286] Add infra to support training high dimension models #251

Closed
wants to merge 1 commit into from

Conversation

zhipeng93
Copy link
Contributor

@zhipeng93 zhipeng93 commented Aug 15, 2023

What is the purpose of the change

This PR introduces a communication infra that can distribute model parameters to multiple servers, so as to support training high dimension models in Flink ML. Moreover, it also introduces a programming API that can simplify the programming of iterative machine learning training process. The motivation cases include:

  • Train a high dimension model whose model data cannot be fitted on a single machine.
  • Support the case where the model data of each key as a double or double[].
  • Support model data with discrete index.
  • Use a set of index to retrieve the model values.
  • Use a set of index and UDF to retrieve the statistics computed from model values.
  • Use a set of index/value to update the model values.
  • Use a set of index/value to update the model, with user specified aggregation logics.
  • Supports AllReduce/ReduceScatter for java generics.
  • Supports AllReduce/ReduceScatter to execute every K iterations.
  • Support the case that user can access the training data in each iteration.
  • Output model data to downstream tasks.
  • Output the intermediate result to downstream tasks.

Brief change log

  • Introduced WorkerOperator and ServerOperator to execute the training logic and maintain the model data, respectively.
  • Added MLSession to store the information that is alive during the training process on WorkerOperator.
  • Added Message to represent the information that can be transfered among workers and servers.
  • Added IterationStageList and IterationStage to describe the iterative machine learning process as a chain of computation/communication stages. Also, computation stage (ProcessStage) and commonly used communication stage (Push/Pull/AllReduce/ReduceScatter) are also added.
  • Added ModelUpdater to let developers describe model updating logic.
  • Added TrainingUtils to ease the programming of iterative machine learning process.
  • Added SharedLongArray and SharedDoubleArray to allow developers reuse the memory accross different iterations.
  • Added unit test to cover the above functionalities.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (JavaDocs)

@zhipeng93 zhipeng93 changed the title [Draft][FLINK-27286] Add communication infra to support training high dimens… [FLINK-27286] Add communication infra to support training high dimension models Aug 15, 2023
@zhipeng93
Copy link
Contributor Author

zhipeng93 commented Aug 15, 2023

Hi @lindong28 @Fanoid @weibozhao @jiangxin369 , can you help to review this PR?

The PR before [1] is a bit complicated since it involves too many changes (i.e., infra, vectors, algorithms). So I extract the infra here as a separate PR. The comments left in [1] are addressed in this PR.

[1] #237

@zhipeng93 zhipeng93 changed the title [FLINK-27286] Add communication infra to support training high dimension models [FLINK-27286] Add infra to support training high dimension models Aug 15, 2023
@zhipeng93 zhipeng93 closed this Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant