[Iceberg]Support setting warehouse data directory for Hadoop catalog #24397

hantangwangd · 2025-01-18T14:08:18Z

Description

This PR enable Presto Iceberg Hadoop catalog to specify an independent warehouse data directory to store table data files, in this way, we can manage metadata files on HDFS and store data files on Object Stores in a formal production environment.

See issue: #24383

Motivation and Context

Enabling Presto Iceberg to leverage the powerful capabilities of object storages

Impact

Hadoop catalog has the capability of leveraging object stores

Test Plan

Build an object storage environment base on MinIO docker, configure iceberg.catalog.warehouse to a local file path, and iceberg.catalog.warehouse.datadir to a s3 path, fully run IcebergDistributedTestBase, IcebergDistributedSmokeTestBase, and TestIcebergDistributedQueries in CI test

Contributor checklist

Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

== RELEASE NOTES ==

Iceberg Connector Changes
* Add table properties ``write_data_path`` to specify independent data write paths for Iceberg tables :pr:`24221`
* Add connector configuration property ``iceberg.catalog.warehouse.datadir`` for Hadoop catalog to specify root data write path for its new created tables :pr:`24221`

agrawalreetika · 2025-01-20T08:39:17Z

presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergUtil.java

+            else {
+                throw new PrestoException(NOT_SUPPORTED, "Not support set write_data_path on catalog: " + catalogType);
+            }
+        }


Would it better to have logic for setting WRITE_DATA_LOCATION here itself? https://github.com/prestodb/presto/pull/24397/files#diff-1bedde049adaf86f49d050ee5b7dfd4584eb9669321946ece016873970d4e9f1R324

String writeDataLocation = IcebergTableProperties.getWriteDataLocation(tableMetadata.getProperties()); if (!Strings.isNullOrEmpty(writeDataLocation)) { if (catalogType.equals(CatalogType.HADOOP)) { tableProperties.put(WRITE_DATA_LOCATION, writeDataLocation); } else { throw new PrestoException(NOT_SUPPORTED, "Not supported set write_data_path on catalog: " + catalogType); } } else { Optional<String> dataLocation = getDataLocationBasedOnWarehouseDataDir(tableMetadata.getSchemaTableName()); dataLocation.ifPresent(location -> tableProperties.put(WRITE_DATA_LOCATION, location)); }

Before setting a warehouse data dir based location, We first need to check 'write_data_path' in the table creation statement, and ensure that non-Hadoop catalog including Hive catalog do not allow this property to be explicitly set. So I extracted this part of logic to IcebergUtil.populateTableProperties(...), and invoked the it from both IcebergNativeMetadata and IcebergHiveMetadata. Seems we cannot just do it in IcebergNativeMetadata. Do you think this makes sense?

Also, tableProperties is an ImmutableMap, so it seems that we cannot simply execute tableProperties.put(WRITE_DATA_LOCATION, location) here.

Ohh sorry I missed, I was suggesting it to be in IcebergUtil.populateTableProperties(...) so it should be propertiesBuilder.put not tableProperties.put (Edited the code suggestion above)

Also in the above since we are already checking if its not HADOOP catalog we are throwing exception. Do you think we need to check anything more here? Or I missing something here?

Good suggestion, got your point now. I moved the method populateTableProperties(...) from IcebergUtil to IcebergAbstractMetadata, and put the entire setting logic for WRITE_DATA_LOCATION you suggested above into it, so that there is no need to perform checks and settings in IcebergNativeMetadata again. Please take a look when available, thanks!

steveburnett

Thanks for the doc! Looks good, only a few nits only of punctuation and capitalization.

presto-docs/src/main/sphinx/connector/iceberg.rst

steveburnett

LGTM! (docs)

Pulled updated branch, new local doc build, looks good. Thanks!

hantangwangd · 2025-01-27T16:55:58Z

Thanks @steveburnett for your suggestion, fixed!

agrawalreetika

Thanks for the change. LGTM

imjalpreet · 2025-01-28T12:09:16Z

presto-iceberg/pom.xml

@@ -263,6 +263,28 @@
            </exclusions>
        </dependency>

+        <dependency>


nit: Can we move this test dependency to follow the other test dependencies after the comment ?

https://github.com/prestodb/presto/blob/5fd6b9cdedcbb1b9a8b05951e14273882d586938/presto-iceberg/pom.xml#L517C9-L517C29

imjalpreet

@hantangwangd Thanks for the PR! I took a first pass and had some minor comments.

imjalpreet · 2025-01-28T13:17:59Z

presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergAbstractMetadata.java

@@ -1023,6 +1040,62 @@ public void setTableProperties(ConnectorSession session, ConnectorTableHandle ta
        transaction.commitTransaction();
    }

+    protected Map<String, String> populateTableProperties(ConnectorTableMetadata tableMetadata, com.facebook.presto.iceberg.FileFormat fileFormat, ConnectorSession session, CatalogType catalogType)
+    {
+        ImmutableMap.Builder<String, String> propertiesBuilder = ImmutableMap.builderWithExpectedSize(5);


nit: should we update the builder size here? I think initially when we had 5 properties it was set to 5 but I can see now we have more than 5.

imjalpreet · 2025-01-28T13:21:16Z

presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergAbstractMetadata.java

+                propertiesBuilder.put(WRITE_DATA_LOCATION, writeDataLocation);
+            }
+            else {
+                throw new PrestoException(NOT_SUPPORTED, "Not support set write_data_path on catalog: " + catalogType);


nit:

Suggested change

throw new PrestoException(NOT_SUPPORTED, "Not support set write_data_path on catalog: " + catalogType);

throw new PrestoException(NOT_SUPPORTED, "Table property write_data_path is not supported with catalog type: " + catalogType);

imjalpreet · 2025-01-28T13:27:09Z

presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergConfig.java

+    }
+
+    @Config("iceberg.catalog.warehouse.datadir")
+    @ConfigDescription("Iceberg catalog default root data writing directory")


nit: should we mention that it is only supported for Hadoop catalog in the ConfigDescription?

Does it make sense to add a check to throw an error at the server startup if this config is present in the config file when the catalog type is not Hadoop?

imjalpreet · 2025-01-28T13:36:47Z

presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergNativeCatalogFactory.java

@@ -52,6 +52,7 @@ public class IcebergNativeCatalogFactory
    private final String catalogName;
    protected final CatalogType catalogType;
    private final String catalogWarehouse;
+    private final String catalogWarehouseDataDir;


Should we move this to IcebergNativeMetadata since it's only being used in that class and not in this class?

imjalpreet · 2025-01-28T13:49:05Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedSmokeTestBase.java

@@ -150,6 +152,34 @@ public void testShowCreateTable()
                        ")", schemaName, getLocation(schemaName, "orders")));
    }

+    @Test
+    public void testTableWithSpecifiedWriteDataLocation()


It looks like testTableWithSpecifiedWriteDataLocation and testShowCreateTableWithSpecifiedWriteDataLocation are the same. Can you please check?

imjalpreet · 2025-01-28T13:49:49Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedSmokeTestBase.java

+            throws IOException
+    {
+        String dataWriteLocation = Files.createTempDirectory("test_table_with_specified_write_data_location3").toAbsolutePath().toString();
+        assertQueryFails(String.format("create table test_table_with_specified_write_data_location3(a int, b varchar) with (write_data_path = '%s')", dataWriteLocation),


Should we attempt to create a partitioned table for this test?

ZacBlanco

I just have a few high level comments. At the core, I understand what we're trying to solve, but I'm not sure yet if this is the right solution. What happens when we have a table which already exists at the warehouse directory but has a data directory which doesn't align with the new datadir property? Should we respect the property or the table in the case of an insert? This could be confusing for users.

And if the table already exists and doesn't doesn't have a metadata folder which is in a "safe" filesystem, should we error, or warn the user? Do we even have a way of knowing that a filesystem is "safe" (supports atomic renames)?

Correct me if I'm wrong, but in theory we could have already supported this use case within the iceberg connector by using SetTableProperty procedure and just setting "write.data.path" on the individual tables, right? From my understanding, all this change does is provide a default for new tables and makes it a viewable table property. I'm wondering if it might be better providing this as schema-level property that users can set, similar to how the hive connector has a schema-level "location" property. Then, we can set defaults for the data path on schema creation, but override it on a table-level if we prefer.

However, I don't believe the metadata path could be set that way since the HadoopCatalog relies on listing the warehouse metadata directories to get the schemas and tables.

Just some thoughts for discussion. I think I just want to refine our approach and make sure there isn't any ambiguous behavior from the user perspective.

ZacBlanco · 2025-01-29T18:34:28Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergQueryRunner.java

+        icebergProperties.put("iceberg.file-format", format.name());
+        icebergProperties.putAll(getConnectorProperties(CatalogType.valueOf(catalogType), icebergDataDirectory));
+        icebergProperties.putAll(extraConnectorProperties);
+        queryRunner.createCatalog(ICEBERG_CATALOG, "iceberg", ImmutableMap.copyOf(icebergProperties));


Is there any reason to change this?

ZacBlanco · 2025-01-29T18:38:00Z

presto-docs/src/main/sphinx/connector/iceberg.rst

 ``iceberg.catalog.cached-catalog-num``                  The number of Iceberg catalogs to cache. This property is     ``10``
                                                        required if the ``iceberg.catalog.type`` is ``hadoop``.
                                                        Otherwise, it will be ignored.
 ======================================================= ============================================================= ============

+Configure the `Amazon S3 <https://prestodb.io/docs/current/connector/hive.html#amazon-s3-configuration>`_
+properties to specify a S3 location as the warehouse data directory for the Hadoop catalog. This way,
+the data and delete files of Iceberg tables are stored in S3. An example configuration includes:


I think we should explain somewhere in this section when users should specify the datadir and that it needs to be on a filesystem which supports atomic renames.

ZacBlanco · 2025-01-29T18:39:36Z

presto-docs/src/main/sphinx/connector/iceberg.rst

@@ -370,6 +393,9 @@ Property Name                             Description
 ``location``                               Optionally specifies the file system location URI for
                                           the table.

+``write_data_path``                        Optionally specifies the file system location URI for


I would actually prefer if we used the same property name as in iceberg to make it less confusing to users. I know we haven't followed this before but I feel like it makes more sense than what we have now. It allows for more continuity and easier for users to look up the iceberg property reference as well.

hantangwangd force-pushed the support_seperate_write_data_location branch from 0279ea7 to 2390fb0 Compare January 18, 2025 15:14

agrawalreetika reviewed Jan 20, 2025

View reviewed changes

hantangwangd force-pushed the support_seperate_write_data_location branch from 2390fb0 to c3c00e1 Compare January 26, 2025 18:05

hantangwangd marked this pull request as ready for review January 26, 2025 19:07

hantangwangd requested review from steveburnett, elharo, ZacBlanco and a team as code owners January 26, 2025 19:07

hantangwangd requested review from presto-oss, tdcmeehan and agrawalreetika January 26, 2025 19:07

hantangwangd changed the title ~~[WIP][Iceberg]Support setting warehouse data directory for Hadoop catalog~~ [Iceberg]Support setting warehouse data directory for Hadoop catalog Jan 27, 2025

hantangwangd requested review from imjalpreet and kiersten-stokes January 27, 2025 01:04

[Iceberg]Enable setting separate data write location on table creation

bce0716

hantangwangd force-pushed the support_seperate_write_data_location branch from c3c00e1 to d18cd8b Compare January 27, 2025 14:05

steveburnett requested changes Jan 27, 2025

View reviewed changes

presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved

presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved

presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved

[Iceberg]Support setting warehouse data directory for Hadoop catalog

5fd6b9c

hantangwangd force-pushed the support_seperate_write_data_location branch from d18cd8b to 5fd6b9c Compare January 27, 2025 16:01

steveburnett approved these changes Jan 27, 2025

View reviewed changes

agrawalreetika approved these changes Jan 28, 2025

View reviewed changes

imjalpreet reviewed Jan 28, 2025

View reviewed changes

tdcmeehan self-assigned this Jan 29, 2025

ZacBlanco reviewed Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Iceberg]Support setting warehouse data directory for Hadoop catalog #24397

[Iceberg]Support setting warehouse data directory for Hadoop catalog #24397

hantangwangd commented Jan 18, 2025 •

edited

Loading

agrawalreetika Jan 20, 2025

agrawalreetika Jan 20, 2025

hantangwangd Jan 21, 2025

hantangwangd Jan 21, 2025

agrawalreetika Jan 27, 2025

hantangwangd Jan 27, 2025

steveburnett left a comment

steveburnett left a comment

hantangwangd commented Jan 27, 2025

agrawalreetika left a comment

imjalpreet Jan 28, 2025

imjalpreet left a comment

imjalpreet Jan 28, 2025

imjalpreet Jan 28, 2025

imjalpreet Jan 28, 2025

imjalpreet Jan 28, 2025

imjalpreet Jan 28, 2025

imjalpreet Jan 28, 2025

imjalpreet Jan 28, 2025

ZacBlanco left a comment •

edited

Loading

ZacBlanco Jan 29, 2025

ZacBlanco Jan 29, 2025

ZacBlanco Jan 29, 2025

	throw new PrestoException(NOT_SUPPORTED, "Not support set write_data_path on catalog: " + catalogType);
	throw new PrestoException(NOT_SUPPORTED, "Table property write_data_path is not supported with catalog type: " + catalogType);

[Iceberg]Support setting warehouse data directory for Hadoop catalog #24397

Are you sure you want to change the base?

[Iceberg]Support setting warehouse data directory for Hadoop catalog #24397

Conversation

hantangwangd commented Jan 18, 2025 • edited Loading

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveburnett left a comment

Choose a reason for hiding this comment

steveburnett left a comment

Choose a reason for hiding this comment

hantangwangd commented Jan 27, 2025

agrawalreetika left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imjalpreet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZacBlanco left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hantangwangd commented Jan 18, 2025 •

edited

Loading

ZacBlanco left a comment •

edited

Loading