Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for Iceberg support in PrestoCPP #24741

Merged
merged 1 commit into from
Apr 8, 2025

Conversation

agrawalreetika
Copy link
Member

Description

Add documentation for Iceberg support in PrestoCPP

Motivation and Context

Add documentation for Iceberg support in PrestoCPP

Impact

Iceberg documentation improvement

Test Plan

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== NO RELEASE NOTE ==

@agrawalreetika agrawalreetika requested a review from yingsu00 March 17, 2025 18:46
@agrawalreetika agrawalreetika self-assigned this Mar 17, 2025
@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Mar 17, 2025
@prestodb-ci prestodb-ci requested review from a team, nmahadevuni and auden-woolfson and removed request for a team March 17, 2025 18:46
@github-actions github-actions bot added the docs label Mar 17, 2025
@prestodb-ci prestodb-ci requested a review from a team March 17, 2025 18:49
@github-project-automation github-project-automation bot moved this to 🆕 Unprioritized in Presto Documentation Mar 17, 2025
auden-woolfson
auden-woolfson previously approved these changes Mar 17, 2025
Copy link
Contributor

@auden-woolfson auden-woolfson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc! Some nits and suggestions, nothing major.

@github-project-automation github-project-automation bot moved this from 🆕 Unprioritized to 🏗 In progress in Presto Documentation Mar 17, 2025
Copy link
Contributor

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @agrawalreetika! I see PrestoCPP and Presto C++ were mixed. Let's unify them to Presto C++

are ``HIVE``, ``HADOOP``, and ``NESSIE`` and ``REST``.

``iceberg.hadoop.config.resources`` The path(s) for Hadoop configuration resources.
``iceberg.hadoop.config.resources`` The path(s) for Hadoop configuration resources. Yes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Presto C++ Support value is missing

as a join with the data of the equality delete files.

``iceberg.enable-parquet-dereference-pushdown`` Enable parquet dereference pushdown. ``true``
``iceberg.enable-parquet-dereference-pushdown`` Enable parquet dereference pushdown. ``true`` Yes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Presto C++ Support value is missing

statistics file cache.
======================================================= ============================================================= ============
======================================================= ============================================================= ================================== =================== =========================================

Table Properties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also add a section for the C++ support for Table Properties? Just saying it's not suported because write is not implemented yet

@@ -1511,13 +1587,23 @@ schema evolution, such as adding, dropping, and renaming columns. With schema
evolution, users can evolve a table schema with SQL after enabling the Presto
Iceberg connector.

Presto C++ Support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for Partition Column Transform. Could you please add Presto C++ support as well? Thanks!


* Supports reading and writing of DWRF and PARQUET file formats, supports reading ORC file format.
Hive Connector - Iceberg Table Support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the following structure would be clearer
Supported Connectors

  • Hive connector
    • Hive table support
    • Iceberg table support
  • TPCH connector
    • ....
  • Fuzzer connector

@agrawalreetika
Copy link
Member Author

@steveburnett @yingsu00 Thanks for your review. I have made the changes based on your comments. Please have a look at your convenience.

Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc, it's very helpful. Only a couple of nit and little thing.


``SELECT`` Yes Yes Read is supported in Presto C++ including those with positional delete files.

``INSERT INTO`` Yes No
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: duplicate with line 1155.


* Only read operations are supported for Iceberg tables.

* The Iceberg connector supports both V1 and V2 tables, including those with positional delete files, but does not support equality delete files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a reader's perspective, I'm a little confused about the Iceberg connector here, do you think it's more appropriate to use something like Hive connector for Iceberg or something else?

Copy link
Member Author

@agrawalreetika agrawalreetika Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need to create an iceberg catalog in Prestissimo as well for accessing Iceberg tables but the underline implementation on Velox is via Hive connector itself which is why I thought of keeping Hive Connector as main header and then different table support.

I am open for suggestions here whatever we all think is more clear.
@yingsu00 What's your opinion on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't describe it clearly. What I mean is, the content structure Hive connector/Iceberg Table Support is no problem, just the words Iceberg connector here may bring a little confusion. If I were a newer, I might wonder whether there is an Iceberg connector for presto C++ that I haven't noticed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agrawalreetika : If we use Java side Iceberg connector or catalog code, then its better to call it Iceberg Connector. The reuse of HiveConnector is an internal detail imo.

Copy link
Contributor

@yingsu00 yingsu00 Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could call it Iceberg connector because it's in Presto C++ point of view, not Velox point of view. But it would be helpful to explain that the Presto C++ Iceberg connector and catalog is backed up by Velox HiveConnector.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hantangwangd Thanks for the clarification. So here as @aditi-pandit & @yingsu00 mentioned from Presto C++ point of view it's going to be a different iceberg connector with something like this -

connector.name=iceberg

That is why I thought calling it iceberg connector would be better in the Presto C++ document.

@hantangwangd @aditi-pandit @yingsu00 Please take a look at the document, its already updated based on the points we discussed here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got the point, thank you all for your perspective @aditi-pandit @yingsu00 @agrawalreetika.

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I pulled the updated branch, ran a new local doc build, and reviewed again.

I found a few nits that aren't directly related to your new content here that I hope we can fix here, I don't think that fixing them will add a lot of work or delay the PR.

hantangwangd
hantangwangd previously approved these changes Mar 19, 2025
steveburnett
steveburnett previously approved these changes Mar 19, 2025
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local doc build, looks good. Thank you!

@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Presto Documentation Mar 19, 2025
as a join with the data of the equality delete files.

``iceberg.enable-parquet-dereference-pushdown`` Enable parquet dereference pushdown. ``true``
``iceberg.enable-parquet-dereference-pushdown`` Enable parquet dereference pushdown. ``true`` Yes NA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it NA? If it's fails now, just say No


Example: ``/etc/hadoop/conf/core-site.xml.`` This property
is required if the iceberg.catalog.type is ``hadoop``.
Otherwise, it will be ignored.

``iceberg.file-format`` The storage file format for Iceberg tables. The available ``PARQUET``
``iceberg.file-format`` The storage file format for Iceberg tables. The available ``PARQUET`` Yes NA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a little explanation here why it's NA? We can say "NA, write is not supported yet"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for other NA items. Thanks!

updates.

``iceberg.delete-as-join-rewrite-enabled`` When enabled, equality delete row filtering is applied ``true``
``iceberg.delete-as-join-rewrite-enabled`` When enabled, equality delete row filtering is applied ``true`` Yes Yes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Presto C++ supports it. Presto C++ do not interpret the equality delete file as joins, but would compile it into domain filters or filter functions. Also, reading equality delete file PR is not merged yet. Just say No for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I missed equality delete here.


``iceberg.rows-for-metadata-optimization-threshold`` The maximum number of partitions in an Iceberg table to ``1000``
``iceberg.rows-for-metadata-optimization-threshold`` The maximum number of partitions in an Iceberg table to ``1000`` Yes Yes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this one on coordinator only?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I checked in code its usage is in IcebergMetadataOptimizer so I am assuming this is coordinator only. Do you think, I should check anything more on this?

===================================================== ======================================================================= =================== ==================
Property Name Description Presto Java Support Presto C++ Support
===================================================== ======================================================================= =================== ==================
``iceberg.delete_as_join_rewrite_enabled`` Overrides the behavior of the connector property Yes Yes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be No, or NA

Presto C++ Support
^^^^^^^^^^^^^^^^^^

All above extra hidden metadata tables are supported in Presto C++.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to double confirm that you have verified they are working, right?

Copy link
Member Author

@agrawalreetika agrawalreetika Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from$changelog everything works, $changelog throws an error for C++ which I missed adding here. I opened an issue for the same. #24816

@agrawalreetika
Copy link
Member Author

@yingsu00 I have updated all the NA with No to keep the consistency and also updated the changes based on your last few comments. Please review when ever you get a chance. Thanks!

Presto C++ Support
^^^^^^^^^^^^^^^^^^

Table properties are not supported in Presto C++ because write operations have not been implemented.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider re-wording this. It's not quite clear what the statement "Table properties are not supported ..." means. Does that mean reading them? writing them? considering them during writes? I think we should be a bit more specific

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider re-wording this. It's not quite clear what the statement "Table properties are not supported ..." means. Does that mean reading them? writing them? considering them during writes? I think we should be a bit more specific

I think what @ZacBlanco said makes sense. @agrawalreetika Will you be able to try them and update the doc separating the read and write

Presto C++ Support
^^^^^^^^^^^^^^^^^^

All above extra hidden metadata columns are supported in Presto C++.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All above extra hidden metadata columns are supported in Presto C++.
All above metadata columns are supported in Presto C++.

above extra hidden

Three adjectives in a row feels like a lot. I think we can reduce

Copy link
Member Author

@agrawalreetika agrawalreetika Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here existing extra hidden metadata columns is taken from the main heading. But if we think we should just rephrase it saying - All above metadata columns are supported in Presto C++ that's fine with me, LMK.

Copy link
Contributor

@ZacBlanco ZacBlanco Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feel overly description to have three adjectives in the sentence - I think this gets the point across. Let's remove them

Presto C++ Support
^^^^^^^^^^^^^^^^^^

All above extra hidden metadata tables, except `$changelog`, are supported in Presto C++.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All above extra hidden metadata tables, except `$changelog`, are supported in Presto C++.
All above metadata tables, except `$changelog`, are supported in Presto C++.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here existing extra hidden metadata tables is taken from the main heading. But if we think we should just rephrase it saying - All above metadata tables, except $changelog, are supported in Presto C++. that's fine with me, LMK.

Presto C++ Support
~~~~~~~~~~~~~~~~~~

Read from the tables with Partition column transform is supported in Presto C++.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Read from the tables with Partition column transform is supported in Presto C++.
Reads from tables with partition column transforms is supported in Presto C++.

Presto C++ Support
^^^^^^^^^^^^^^^^^^

Schema Evolution is supported in Presto C++.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Schema Evolution is supported in Presto C++.
Schema evolution is supported in Presto C++.

Presto C++ Support
^^^^^^^^^^^^^^^^^^

Presto C++ supports Parquet writer versions V1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little confusing because earlier the properties section stated that writes are not supported?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it point.
It's about actually reading, so I think I can rephrase it something like - Presto C++ supports reading Parquet data written with Parquet writer version V1. ?

Presto C++ Support
^^^^^^^^^^^^^^^^^^

Time Travel is supported in Presto C++.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Time Travel is supported in Presto C++.
Time travel queries are supported in Presto C++.

@github-project-automation github-project-automation bot moved this from ✅ Done to 🏗 In progress in Presto Documentation Apr 1, 2025
@agrawalreetika agrawalreetika force-pushed the iceberg-native-doc branch 3 times, most recently from 95463f3 to e2dfc98 Compare April 7, 2025 19:52
Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment on consistency in our grammar, otherwise I think this looks great. Thanks!

Comment on lines 1676 to 1689
Schema evolution is supported in Presto C++.

Parquet Writer Version
----------------------

Presto now supports Parquet writer versions V1 and V2 for the Iceberg catalog.
It can be toggled using the session property ``parquet_writer_version`` and the config property ``hive.parquet.writer.version``.
Valid values for these properties are ``PARQUET_1_0`` and ``PARQUET_2_0``. Default is ``PARQUET_1_0``.

Presto C++ Support
^^^^^^^^^^^^^^^^^^

Presto C++ supports reading Parquet data written with Parquet writer version V1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that across these Presto C++ Support sections we have varying forms

... is supported in Presto C++

and

Presto C++ supports ....

Two comments on this

  1. We should be consistent
  2. I prefer that we standardize on the 2nd as it's in active voice, which is generally the form we want to write for our documentation (@steveburnett can you confirm the 2nd is the form that we should prefer?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should prefer the use of active voice, but exceptions where passive voice is better can happen. Sometimes active voice can be more awkward depending on the emphasis you want to communicate.

See

  • the Gitlab Documentation Style Guide entry for Active Voice (shorter)
  • the Google Developer Documentation Style Guide entry for Active Voice (longer and more examples)

tl;dr: use active voice, but if active voice is awkward then passive voice is okay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this context of the "Presto C++ Support" subtopics, I would argue that passive voice is better.

  • The statements immediately follow the heading "Presto C++ Support" so "Presto C++ supports" repeat the information without adding to it, and

  • Passive voice here follows the guideline in the Google Developer Documentation Style Guide entry for Active Voice "To emphasize an object over an action."

Using passive voice here emphasizes the object "HIVE, NESSIE, REST, and HADOOP Iceberg catalogs" and makes it easier for the user to find the information they need.

See the screenshots I made in a local doc build to show the difference. I think the second one is the better one.

Screenshot 2025-04-08 at 10 20 42 AM
Screenshot 2025-04-08 at 10 20 15 AM

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change and the new consistency! One nit.

Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local doc build, looks good. Thanks!

@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Presto Documentation Apr 8, 2025
@steveburnett steveburnett merged commit 7196add into prestodb:master Apr 8, 2025
94 checks passed
@agrawalreetika agrawalreetika deleted the iceberg-native-doc branch April 8, 2025 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs from:IBM PR from IBM
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

8 participants