-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[FLINK-38568] [mysql-cdc] [cdc-base] Optimize binlog split lookup using binary search #4166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
hi , @ruanhang1993. Could you have the time to review this Pr? Thank you very much ~ |
|
@huyuanfeng2018 please also add this improvement to base framework |
| if (sortedSplits == null || sortedSplits.isEmpty()) { | ||
| return null; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current PR is already in a good shape, I only have two possible suggestions:
(1) Considering that most tables use auto-increment primary keys and that INSERT operations outnumber UPDATE operations in the binlog, the majority of events in the binlog will typically correspond to either the last split or the first split. Leveraging this data locality, we can further optimize by first checking whether the changelog matches the first or last split and then performing binary search on the remaining splits.
(2) We can also do same optimization for IncrementalSourceStreamFetcher which is used by other cdc connectors
WDYT? @huyuanfeng2018
|
Thanks for @loserwang1024 and @leonardBang for the review. I totally agree with this optimization for the auto-increment scenario. In this case, almost only one comparison is needed, which is very cool. At the same time, I have added this optimization in other sources as well, except for MongoDB and Oracle. Please take the time to review it again~ |
26fd827 to
e5aeeb1
Compare
|
CI failed. Apache Download CDN has already removed the Flink 1.20.1 binaries pkg. Perhaps we need to upgrade the dependency version. |
| * @param key The chunk key to search for | ||
| * @return The split containing the key, or null if not found | ||
| */ | ||
| public static FinishedSnapshotSplitInfo findSplitByKeyBinary( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @huyuanfeng2018 for the update, the updating looks good, one minor comment: this Utils class is a little long after this PR, could we extract a new Utils like SplitKeyUtils to make the code more readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick review~
done.
cbd87e4 to
13c7594
Compare
@huyuanfeng2018 Could you append a commit to bump flink version to 1.20.3 to fix this issue ? I've thecked that https://dlcdn.apache.org/flink/flink-1.20.3/ should be okay. |
ok |
…ary keys Optimize
13c7594 to
168b487
Compare
|
@huyuanfeng2018 I like your community cooperation style, now we can rebase this PR to latest master and convert to normal one |
Already rebased onto master. Waiting for CI to finish, I will change the PR status from Draft to Ready. |
leonardBang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
What is the purpose of the change
Optimize CDC binlog split lookup from O(n) to O(log n) using binary search.
Brief change log
sortFinishedSplitInfos()andfindSplitByKeyBinary()methodsBinlogSplitReaderto use binary search instead of linear searchPerformance Impact
Verifying this change
Added 6 new unit tests covering various scenarios including edge cases and consistency verification with linear search.