-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Iceberg]Support setting warehouse data directory for Hadoop catalog #24397
base: master
Are you sure you want to change the base?
[Iceberg]Support setting warehouse data directory for Hadoop catalog #24397
Conversation
0279ea7
to
2390fb0
Compare
else { | ||
throw new PrestoException(NOT_SUPPORTED, "Not support set write_data_path on catalog: " + catalogType); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it better to have logic for setting WRITE_DATA_LOCATION
here itself? https://github.com/prestodb/presto/pull/24397/files#diff-1bedde049adaf86f49d050ee5b7dfd4584eb9669321946ece016873970d4e9f1R324
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String writeDataLocation = IcebergTableProperties.getWriteDataLocation(tableMetadata.getProperties());
if (!Strings.isNullOrEmpty(writeDataLocation)) {
if (catalogType.equals(CatalogType.HADOOP)) {
tableProperties.put(WRITE_DATA_LOCATION, writeDataLocation);
}
else {
throw new PrestoException(NOT_SUPPORTED, "Not supported set write_data_path on catalog: " + catalogType);
}
}
else {
Optional<String> dataLocation = getDataLocationBasedOnWarehouseDataDir(tableMetadata.getSchemaTableName());
dataLocation.ifPresent(location -> tableProperties.put(WRITE_DATA_LOCATION, location));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before setting a warehouse data dir based location, We first need to check 'write_data_path' in the table creation statement, and ensure that non-Hadoop catalog including Hive catalog do not allow this property to be explicitly set. So I extracted this part of logic to IcebergUtil.populateTableProperties(...)
, and invoked the it from both IcebergNativeMetadata
and IcebergHiveMetadata
. Seems we cannot just do it in IcebergNativeMetadata
. Do you think this makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, tableProperties
is an ImmutableMap, so it seems that we cannot simply execute tableProperties.put(WRITE_DATA_LOCATION, location)
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh sorry I missed, I was suggesting it to be in IcebergUtil.populateTableProperties(...)
so it should be propertiesBuilder.put
not tableProperties.put
(Edited the code suggestion above)
Also in the above since we are already checking if its not HADOOP
catalog we are throwing exception. Do you think we need to check anything more here? Or I missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, got your point now. I moved the method populateTableProperties(...)
from IcebergUtil
to IcebergAbstractMetadata
, and put the entire setting logic for WRITE_DATA_LOCATION
you suggested above into it, so that there is no need to perform checks and settings in IcebergNativeMetadata
again. Please take a look when available, thanks!
2390fb0
to
c3c00e1
Compare
c3c00e1
to
d18cd8b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the doc! Looks good, only a few nits only of punctuation and capitalization.
d18cd8b
to
5fd6b9c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Pulled updated branch, new local doc build, looks good. Thanks!
Thanks @steveburnett for your suggestion, fixed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change. LGTM
@@ -263,6 +263,28 @@ | |||
</exclusions> | |||
</dependency> | |||
|
|||
<dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we move this test dependency to follow the other test dependencies after the comment <!-- for testing -->
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hantangwangd Thanks for the PR! I took a first pass and had some minor comments.
@@ -1023,6 +1040,62 @@ public void setTableProperties(ConnectorSession session, ConnectorTableHandle ta | |||
transaction.commitTransaction(); | |||
} | |||
|
|||
protected Map<String, String> populateTableProperties(ConnectorTableMetadata tableMetadata, com.facebook.presto.iceberg.FileFormat fileFormat, ConnectorSession session, CatalogType catalogType) | |||
{ | |||
ImmutableMap.Builder<String, String> propertiesBuilder = ImmutableMap.builderWithExpectedSize(5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we update the builder size here? I think initially when we had 5 properties it was set to 5 but I can see now we have more than 5.
propertiesBuilder.put(WRITE_DATA_LOCATION, writeDataLocation); | ||
} | ||
else { | ||
throw new PrestoException(NOT_SUPPORTED, "Not support set write_data_path on catalog: " + catalogType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
throw new PrestoException(NOT_SUPPORTED, "Not support set write_data_path on catalog: " + catalogType); | |
throw new PrestoException(NOT_SUPPORTED, "Table property write_data_path is not supported with catalog type: " + catalogType); |
} | ||
|
||
@Config("iceberg.catalog.warehouse.datadir") | ||
@ConfigDescription("Iceberg catalog default root data writing directory") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we mention that it is only supported for Hadoop catalog in the ConfigDescription?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to add a check to throw an error at the server startup if this config is present in the config file when the catalog type is not Hadoop?
@@ -52,6 +52,7 @@ public class IcebergNativeCatalogFactory | |||
private final String catalogName; | |||
protected final CatalogType catalogType; | |||
private final String catalogWarehouse; | |||
private final String catalogWarehouseDataDir; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this to IcebergNativeMetadata since it's only being used in that class and not in this class?
@@ -150,6 +152,34 @@ public void testShowCreateTable() | |||
")", schemaName, getLocation(schemaName, "orders"))); | |||
} | |||
|
|||
@Test | |||
public void testTableWithSpecifiedWriteDataLocation() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like testTableWithSpecifiedWriteDataLocation
and testShowCreateTableWithSpecifiedWriteDataLocation
are the same. Can you please check?
throws IOException | ||
{ | ||
String dataWriteLocation = Files.createTempDirectory("test_table_with_specified_write_data_location3").toAbsolutePath().toString(); | ||
assertQueryFails(String.format("create table test_table_with_specified_write_data_location3(a int, b varchar) with (write_data_path = '%s')", dataWriteLocation), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we attempt to create a partitioned table for this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have a few high level comments. At the core, I understand what we're trying to solve, but I'm not sure yet if this is the right solution. What happens when we have a table which already exists at the warehouse directory but has a data directory which doesn't align with the new datadir property? Should we respect the property or the table in the case of an insert? This could be confusing for users.
And if the table already exists and doesn't doesn't have a metadata folder which is in a "safe" filesystem, should we error, or warn the user? Do we even have a way of knowing that a filesystem is "safe" (supports atomic renames)?
Correct me if I'm wrong, but in theory we could have already supported this use case within the iceberg connector by using SetTableProperty procedure and just setting "write.data.path" on the individual tables, right? From my understanding, all this change does is provide a default for new tables and makes it a viewable table property. I'm wondering if it might be better providing this as schema-level property that users can set, similar to how the hive connector has a schema-level "location" property. Then, we can set defaults for the data path on schema creation, but override it on a table-level if we prefer.
However, I don't believe the metadata path could be set that way since the HadoopCatalog relies on listing the warehouse metadata directories to get the schemas and tables.
Just some thoughts for discussion. I think I just want to refine our approach and make sure there isn't any ambiguous behavior from the user perspective.
icebergProperties.put("iceberg.file-format", format.name()); | ||
icebergProperties.putAll(getConnectorProperties(CatalogType.valueOf(catalogType), icebergDataDirectory)); | ||
icebergProperties.putAll(extraConnectorProperties); | ||
queryRunner.createCatalog(ICEBERG_CATALOG, "iceberg", ImmutableMap.copyOf(icebergProperties)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to change this?
``iceberg.catalog.cached-catalog-num`` The number of Iceberg catalogs to cache. This property is ``10`` | ||
required if the ``iceberg.catalog.type`` is ``hadoop``. | ||
Otherwise, it will be ignored. | ||
======================================================= ============================================================= ============ | ||
|
||
Configure the `Amazon S3 <https://prestodb.io/docs/current/connector/hive.html#amazon-s3-configuration>`_ | ||
properties to specify a S3 location as the warehouse data directory for the Hadoop catalog. This way, | ||
the data and delete files of Iceberg tables are stored in S3. An example configuration includes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should explain somewhere in this section when users should specify the datadir and that it needs to be on a filesystem which supports atomic renames.
@@ -370,6 +393,9 @@ Property Name Description | |||
``location`` Optionally specifies the file system location URI for | |||
the table. | |||
|
|||
``write_data_path`` Optionally specifies the file system location URI for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually prefer if we used the same property name as in iceberg to make it less confusing to users. I know we haven't followed this before but I feel like it makes more sense than what we have now. It allows for more continuity and easier for users to look up the iceberg property reference as well.
Description
This PR enable Presto Iceberg Hadoop catalog to specify an independent warehouse data directory to store table data files, in this way, we can manage metadata files on HDFS and store data files on Object Stores in a formal production environment.
See issue: #24383
Motivation and Context
Enabling Presto Iceberg to leverage the powerful capabilities of object storages
Impact
Hadoop catalog has the capability of leveraging object stores
Test Plan
iceberg.catalog.warehouse
to a local file path, andiceberg.catalog.warehouse.datadir
to a s3 path, fully runIcebergDistributedTestBase
,IcebergDistributedSmokeTestBase
, andTestIcebergDistributedQueries
in CI testContributor checklist
Release Notes