Skip to content

Conversation

@huyuanfeng2018
Copy link
Contributor

@huyuanfeng2018 huyuanfeng2018 commented Oct 27, 2025

What is the purpose of the change

Optimize CDC binlog split lookup from O(n) to O(log n) using binary search.

Brief change log

  • Add sortFinishedSplitInfos() and findSplitByKeyBinary() methods
  • Update BinlogSplitReader to use binary search instead of linear search
  • Add comprehensive unit tests

Performance Impact

  • Time Complexity: O(n) → O(log n)

Verifying this change

Added 6 new unit tests covering various scenarios including edge cases and consistency verification with linear search.

@huyuanfeng2018
Copy link
Contributor Author

hi , @ruanhang1993. Could you have the time to review this Pr? Thank you very much ~

@loserwang1024
Copy link
Contributor

@huyuanfeng2018 please also add this improvement to base framework

if (sortedSplits == null || sortedSplits.isEmpty()) {
return null;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current PR is already in a good shape, I only have two possible suggestions:

(1) Considering that most tables use auto-increment primary keys and that INSERT operations outnumber UPDATE operations in the binlog, the majority of events in the binlog will typically correspond to either the last split or the first split. Leveraging this data locality, we can further optimize by first checking whether the changelog matches the first or last split and then performing binary search on the remaining splits.

(2) We can also do same optimization for IncrementalSourceStreamFetcher which is used by other cdc connectors

WDYT? @huyuanfeng2018

@huyuanfeng2018
Copy link
Contributor Author

huyuanfeng2018 commented Nov 12, 2025

Thanks for @loserwang1024 and @leonardBang for the review. I totally agree with this optimization for the auto-increment scenario. In this case, almost only one comparison is needed, which is very cool. At the same time, I have added this optimization in other sources as well, except for MongoDB and Oracle. Please take the time to review it again~

@huyuanfeng2018
Copy link
Contributor Author

CI failed. Apache Download CDN has already removed the Flink 1.20.1 binaries pkg. Perhaps we need to upgrade the dependency version.

* @param key The chunk key to search for
* @return The split containing the key, or null if not found
*/
public static FinishedSnapshotSplitInfo findSplitByKeyBinary(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @huyuanfeng2018 for the update, the updating looks good, one minor comment: this Utils class is a little long after this PR, could we extract a new Utils like SplitKeyUtils to make the code more readable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick review~

done.

@leonardBang
Copy link
Contributor

CI failed. Apache Download CDN has already removed the Flink 1.20.1 binaries pkg. Perhaps we need to upgrade the dependency version.

@huyuanfeng2018 Could you append a commit to bump flink version to 1.20.3 to fix this issue ? I've thecked that https://dlcdn.apache.org/flink/flink-1.20.3/ should be okay.

@huyuanfeng2018
Copy link
Contributor Author

CI failed. Apache Download CDN has already removed the Flink 1.20.1 binaries pkg. Perhaps we need to upgrade the dependency version.

@huyuanfeng2018 Could you append a commit to bump flink version to 1.20.3 to fix this issue ? I've thecked that https://dlcdn.apache.org/flink/flink-1.20.3/ should be okay.

ok

@huyuanfeng2018 huyuanfeng2018 marked this pull request as draft November 12, 2025 07:02
@leonardBang
Copy link
Contributor

@huyuanfeng2018 I like your community cooperation style, now we can rebase this PR to latest master and convert to normal one

@huyuanfeng2018
Copy link
Contributor Author

@huyuanfeng2018 I like your community cooperation style, now we can rebase this PR to latest master and convert to normal one

Already rebased onto master. Waiting for CI to finish, I will change the PR status from Draft to Ready.

@huyuanfeng2018 huyuanfeng2018 changed the title [FLINK-38568][mysql-cdc] Optimize binlog split lookup using binary search [FLINK-38568] [mysql-cdc] [cdc-base] Optimize binlog split lookup using binary search Nov 12, 2025
@huyuanfeng2018 huyuanfeng2018 marked this pull request as ready for review November 13, 2025 01:39
Copy link
Contributor

@leonardBang leonardBang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@leonardBang leonardBang merged commit 7a6bfd8 into apache:master Nov 13, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants