Skip to content

Upsertion memory usage grows exponentially as table size grows #2138

@mwa28

Description

@mwa28

Apache Iceberg version

0.9.0

Please describe the bug 🐞

Hello,

I am trying out the new upsert method on a table created as follows on the AWS Glue Catalog :

catalog = load_catalog("glue", **{"type": "glue"})
schema = Schema(
    NestedField(1, "dt_insert", StringType(), required=True),
    NestedField(2, "controller_id", StringType(), required=True),
    NestedField(3, "timestamp", StringType(), required=True),
    NestedField(4, "data_type", StringType(), required=True),
    NestedField(5, "parameter", StringType(), required=True),
    NestedField(6, "value", StringType(), required=True),
    NestedField(7, "data_source", StringType(), required=True),
    NestedField(8, "unique_id", StringType(), required=True),
    NestedField(9, "dt_import_utc", StringType(), required=True),
    identifier_field_ids=[8],  # 'unique_id' is the primary key
)
catalog.create_table(
        identifier=<database>.<table_name>,
        schema=schema,
        partition_spec=PartitionSpec(
            PartitionField(
                source_id=9,
                field_id=1000,
                transform=IdentityTransform(),
                name="dt_import_utc",
            ),
        ),
        sort_order=SortOrder(
            SortField(source_id=2),  # controller_id
            SortField(source_id=4),  # data_type
            SortField(source_id=3),  # dt_insert
            SortField(source_id=1),  # timestamp
        ),
        location="s3://bucket/prefix/catalog",
        properties={
      "write.format.default" : "parquet",
      "write.target-file-size-bytes" : 134217728, # 128 MB
      "write.metadata.delete-after-commit.enabled" : True,
      "write.metadata.previous-versions-max" : 5
    })

unique_id is a the merged string of columns 1, 2, 3, 4 and 5 with an underscore.

My upsertion is a simple pandas dataframe dumped into a dict and being transformed to pyarrow for upsertion (I know pyarrow accepts directly a pandas df but doing this for an internal reason that requires writing the df in a stage and read in another, so i am dumping as json for readability).

iceberg_table = catalog.load_table(table_name)
table =  pa.Table.from_pydict(to_upsert, schema=iceberg_table.schema().as_arrow())
iceberg_table.upsert(df=table, join_cols=["unique_id"])

With AWS Athena, a MERGE INTO statement takes about 3 seconds to run on the table and scans 1.11MB of data before completion.

MERGE INTO <database>.<table> target
            USING <database>.<temp_table> source
            ON (target."unique_id" = source."unique_id")
            WHEN MATCHED THEN
                UPDATE SET "dt_insert" = source."dt_insert", "controller_id" = source."controller_id", "timestamp" = source."timestamp", "data_type" = source."data_type", "parameter" = source."parameter", "value" = source."value", "data_source" = source."data_source", "dt_import_utc" = source."dt_import_utc", "unique_id" = source."unique_id"
            WHEN NOT MATCHED THEN
                INSERT ("dt_insert", "controller_id", "timestamp", "data_type", "parameter", "value", "data_source", "dt_import_utc", "unique_id")
                VALUES (source."dt_insert", source."controller_id", source."timestamp", source."data_type", source."parameter", source."value", source."data_source", source."dt_import_utc", source."unique_id")

(statement generated by using the to_iceberg from awswrangler)

Meanwhile when i try with pyiceberg upsert, it is using more than 10240MB. I am running on AWS lambda and it is causing an out of memory error.

I have no issue with the append() function, it completes fairly quickly but it seems that the upsert needs further optimization to be able to efficiently retrieve only relevant data.

Current table size is at 18.5GB for both Athena based statement and pyiceberg upsert function call.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions