-
Notifications
You must be signed in to change notification settings - Fork 416
Feature: MERGE/Upsert Support #1534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
54 commits
Select commit
Hold shift + click to select a range
fccb74b
test
mattmartin14 7298589
unit testing
mattmartin14 25bc9cf
adding unit tests
mattmartin14 af6c868
adding unit tests
mattmartin14 94be807
adding unit tests
mattmartin14 269d9f5
adding unit tests
mattmartin14 f44c61a
adding unit tests
mattmartin14 a96fdf9
finished unit tests
mattmartin14 fa5ab35
removed unnecesary return
mattmartin14 cfa2277
updated poetry manifest list for datafusion package dependency
mattmartin14 35f29be
added license headers, cleaned up dead code
mattmartin14 6c68d0d
updated the merge function to use bools for matched and not matched rows
mattmartin14 2d1e8ae
incorporated changes for boolExpression. It simplified the filters a lot
mattmartin14 f988f25
moved the filter build function to a separate function to accomodate …
mattmartin14 43393b4
removed unneccessary comment
mattmartin14 9a561b4
removed test files
mattmartin14 9ef39a6
bug fixes and removed some more dependency on datafusion
mattmartin14 2ba1ed6
updated various items including adding a dataclass return result
mattmartin14 a42eecd
updated merge_rows to remove dependency from datafusion! wahoo
mattmartin14 1305f58
renamed merge_rows to upsert, removed unnecessary code. will put in f…
mattmartin14 b2be3db
adding params to unit testing for pytest; having some errors
mattmartin14 f5688ad
fixed bugs on unit testing; added context wrapper for txn; fixed vari…
mattmartin14 7d55a4e
bug fixes
mattmartin14 2e14767
updated some error throwing items
mattmartin14 85c5848
moved datafusion to just a dev dependency in poetry toml
mattmartin14 6472071
updated UpsertRow class to be recognized in the return statement
mattmartin14 51c34da
removed some spaces and streamlined assert statements in unit testing
mattmartin14 862a69a
updated test cases to use an InMemory catalog
mattmartin14 3731b86
updated some formatting; added more commentary on the rows_to_update …
mattmartin14 bbb35d6
rebased poetry lock file and pyproject.toml file; removed sf repo info
mattmartin14 c8189c9
Merge branch 'main' into main
mattmartin14 02af4d4
updated equality checks with not instead of == false
mattmartin14 cc75192
ran ruff check --fix
mattmartin14 998d98b
manually added lint fixes and updated poetry toml and lock files. tha…
mattmartin14 513c839
added formatting fices
mattmartin14 0fd6446
remove the node_modules
mattmartin14 5fc3478
updated code for another round of fixes
mattmartin14 6cef789
removed npm uneeded files
mattmartin14 40b69b8
fixed formatting on upsert function for docs build
mattmartin14 804c526
Merge branch 'main' into main
mattmartin14 09e0347
rebased for poetry lock files
mattmartin14 ca2d904
updated lock files. thanks kevin
mattmartin14 77375fb
fixed other changes
mattmartin14 ba4db49
fixed gitignore file
mattmartin14 622e66c
no whitespace
mattmartin14 9e79dad
fixed vendor fb file from kevins changes
mattmartin14 4cbf3e3
reverting vendor changes
mattmartin14 5333a1e
removing node modules
mattmartin14 11a25be
updating vendor files
mattmartin14 03a8d10
Update vendor/fb303/FacebookService.py
mattmartin14 8a2143c
updated vendor files
mattmartin14 e719cf8
updated vendor files
mattmartin14 245b4a9
attempting to update poetry files
mattmartin14 e3e9611
Merge branch 'main' into main
mattmartin14 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,118 @@ | ||
| # Licensed to the Apache Software Foundation (ASF) under one | ||
| # or more contributor license agreements. See the NOTICE file | ||
| # distributed with this work for additional information | ||
| # regarding copyright ownership. The ASF licenses this file | ||
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, | ||
| # software distributed under the License is distributed on an | ||
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| # KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations | ||
| # under the License. | ||
| import functools | ||
| import operator | ||
|
|
||
| import pyarrow as pa | ||
mattmartin14 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| from pyarrow import Table as pyarrow_table | ||
| from pyarrow import compute as pc | ||
|
|
||
| from pyiceberg.expressions import ( | ||
| And, | ||
| BooleanExpression, | ||
| EqualTo, | ||
| In, | ||
| Or, | ||
| ) | ||
|
|
||
|
|
||
| def create_match_filter(df: pyarrow_table, join_cols: list[str]) -> BooleanExpression: | ||
| unique_keys = df.select(join_cols).group_by(join_cols).aggregate([]) | ||
|
|
||
| if len(join_cols) == 1: | ||
| return In(join_cols[0], unique_keys[0].to_pylist()) | ||
| else: | ||
| return Or(*[And(*[EqualTo(col, row[col]) for col in join_cols]) for row in unique_keys.to_pylist()]) | ||
|
|
||
|
|
||
| def has_duplicate_rows(df: pyarrow_table, join_cols: list[str]) -> bool: | ||
| """Check for duplicate rows in a PyArrow table based on the join columns.""" | ||
| return len(df.select(join_cols).group_by(join_cols).aggregate([([], "count_all")]).filter(pc.field("count_all") > 1)) > 0 | ||
|
|
||
|
|
||
| def get_rows_to_update(source_table: pa.Table, target_table: pa.Table, join_cols: list[str]) -> pa.Table: | ||
| """ | ||
| Return a table with rows that need to be updated in the target table based on the join columns. | ||
|
|
||
| When a row is matched, an additional scan is done to evaluate the non-key columns to detect if an actual change has occurred. | ||
| Only matched rows that have an actual change to a non-key column value will be returned in the final output. | ||
| """ | ||
| all_columns = set(source_table.column_names) | ||
| join_cols_set = set(join_cols) | ||
|
|
||
| non_key_cols = list(all_columns - join_cols_set) | ||
|
|
||
| match_expr = functools.reduce(operator.and_, [pc.field(col).isin(target_table.column(col).to_pylist()) for col in join_cols]) | ||
|
|
||
| matching_source_rows = source_table.filter(match_expr) | ||
|
|
||
| rows_to_update = [] | ||
|
|
||
| for index in range(matching_source_rows.num_rows): | ||
mattmartin14 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| source_row = matching_source_rows.slice(index, 1) | ||
|
|
||
| target_filter = functools.reduce(operator.and_, [pc.field(col) == source_row.column(col)[0].as_py() for col in join_cols]) | ||
|
|
||
| matching_target_row = target_table.filter(target_filter) | ||
|
|
||
| if matching_target_row.num_rows > 0: | ||
| needs_update = False | ||
|
|
||
| for non_key_col in non_key_cols: | ||
| source_value = source_row.column(non_key_col)[0].as_py() | ||
| target_value = matching_target_row.column(non_key_col)[0].as_py() | ||
|
|
||
| if source_value != target_value: | ||
mattmartin14 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| needs_update = True | ||
| break | ||
|
|
||
| if needs_update: | ||
| rows_to_update.append(source_row) | ||
|
|
||
| if rows_to_update: | ||
| rows_to_update_table = pa.concat_tables(rows_to_update) | ||
| else: | ||
| rows_to_update_table = pa.Table.from_arrays([], names=source_table.column_names) | ||
|
|
||
| common_columns = set(source_table.column_names).intersection(set(target_table.column_names)) | ||
| rows_to_update_table = rows_to_update_table.select(list(common_columns)) | ||
|
|
||
| return rows_to_update_table | ||
|
|
||
|
|
||
| def get_rows_to_insert(source_table: pa.Table, target_table: pa.Table, join_cols: list[str]) -> pa.Table: | ||
| source_filter_expr = pc.scalar(True) | ||
|
|
||
| for col in join_cols: | ||
| target_values = target_table.column(col).to_pylist() | ||
| expr = pc.field(col).isin(target_values) | ||
|
|
||
| if source_filter_expr is None: | ||
| source_filter_expr = expr | ||
| else: | ||
| source_filter_expr = source_filter_expr & expr | ||
|
|
||
| non_matching_expr = ~source_filter_expr | ||
|
|
||
| source_columns = set(source_table.column_names) | ||
| target_columns = set(target_table.column_names) | ||
|
|
||
| common_columns = source_columns.intersection(target_columns) | ||
|
|
||
| non_matching_rows = source_table.filter(non_matching_expr).select(common_columns) | ||
|
|
||
| return non_matching_rows | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.