Skip to content

feat: Add basic operations for UpdateSchema #1172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

jonathanc-n
Copy link
Contributor

Which issue does this PR close?

What changes are included in this PR?

Added basic functionality to UpdateSchema. Wanted to split it up in two parts.

Are these changes tested?

@jonathanc-n
Copy link
Contributor Author

@Fokko @Xuanwo @liurenjie1024 This should be ready for review

Copy link
Contributor

@CTTY CTTY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jonathanc-n , thanks for the work! I've left some comments. Also there are some implementations missing:

  • functions like move and union_schema
  • apply logic to commit changes to the schema
    Do you plan to address them in this PR?

@@ -21,6 +21,7 @@ use std::collections::{HashMap, HashSet};
use std::fmt::{Display, Formatter};
use std::sync::Arc;

mod update;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably name it update_schema.rs avoid confusion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is best practice to infer this from the folder name. For example codebases such as Datafusion or iceberg-rust, files named metadata are just called metadata.rs under different folders (ex. manifest, puffin, etc.) Unless it becomes ambiguous with other names in the same folder

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very new to rust naming convention, thanks for the context! In this case, probably schema.rs is a better name?

I'm adding a new file update_statistics.rs under the same folder in this PR: #1359 I'll probably rename it to statistics.rs, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the folder gives context, so schema/update.rs -> updating schema. with transaction/update_statistics, there is no context given to what the file is doing if it is called statistics.rs, so update_statistics is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, thanks for the explanation!

/// This method returns a reference to `Self` to allow for method chaining.
fn add_column(
&mut self,
column_name: Vec<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any considerations that we do not want to take the column name string and then find the parent via schema like iceberg-java?

I did notice that iceberg-python followed this pattern, and would love to understand the context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe to avoid having dot in names being a problem, not sure if there is another reason though.

column_name: Vec<String>,
field_type: Type,
doc: Option<String>,
required: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a recent PR to support default values in UpdateSchema, it would be good to port that to iceberg-rs as well: apache/iceberg@602c35a

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would also be nice in a follow up pull request. I created an issue for that here.

/// # Returns
///
/// An empty Ok(()) on success.
pub fn set_column_requirement(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to make this private and add APIs like make_column_optional to call it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just changing the function name would be fine? what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of something like

pub fn require_column(col_name) { set_column_requirement(col_name, true) }

pub fn make_column_optional(col_name) {set_column_requirement(col_name, false)} 

fn set_column_requirement { // this function }


#[allow(dead_code)]
#[derive(Debug)]
pub struct Move {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like move related functions are not implemented yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this will be implemented in a follow up pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants