Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-31753] Support DataStream CoGroup in stream mode with similar performance as DataSet CoGroup #230

Merged
merged 1 commit into from
Apr 12, 2023

Conversation

lindong28
Copy link
Member

@lindong28 lindong28 commented Apr 7, 2023

What is the purpose of the change

Add util methods that allow algorithm developers to co-group two DataStreams with the same semantics and similar performance as DataSet#coGroup(...)

Here are the results of running the benchmark specified in FLINK-31753's JIRA description:

  • DataSet#coGroup takes 27.6 seconds.
  • DataStreamUtils#coGroup takes 31.5 seconds.

The DataStream is roughly 12.3% slower than DataSet. The performance difference should be negligible for real-word applications whose co-group function is non-trivial.

Notes:

  • The benchmark is run with Flink 1.18 snapshot.
  • Most classes under the sort folder are copied from the corresponding classes in apache/flink. We can remove these classese from apache/flink-ml after making the corresponding classes in apache/flink public.

Brief change log

Added the static method DataStreamUtils#coGroup(...).

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? JavaDocs

@lindong28 lindong28 changed the title [FLINK-31753] Support DataStream CoGroup in stream Mode with similar performance as DataSet CoGroup [FLINK-31753] Support DataStream CoGroup in stream mode with similar performance as DataSet CoGroup Apr 7, 2023
@lindong28
Copy link
Member Author

@zhipeng93 Can you help review this PR?

@lindong28 lindong28 force-pushed the FLINK-31753 branch 4 times, most recently from 837fd3a to 514b8b3 Compare April 10, 2023 06:20
Copy link
Contributor

@zhipeng93 zhipeng93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR :) I Left some comments below. Please take a look.

@zhipeng93
Copy link
Contributor

Thanks for the update. LGTM.

@lindong28
Copy link
Member Author

@zhipeng93 Thanks for the review.

@lindong28 lindong28 merged commit d7c9c8b into apache:master Apr 12, 2023
zhipeng93 pushed a commit to zhipeng93/flink-ml that referenced this pull request Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants