-
Notifications
You must be signed in to change notification settings - Fork 260
feat: Add basic operations for UpdateSchema
#1172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@Fokko @Xuanwo @liurenjie1024 This should be ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jonathanc-n , thanks for the work! I've left some comments. Also there are some implementations missing:
- functions like
move
andunion_schema
apply
logic to commit changes to the schema
Do you plan to address them in this PR?
@@ -21,6 +21,7 @@ use std::collections::{HashMap, HashSet}; | |||
use std::fmt::{Display, Formatter}; | |||
use std::sync::Arc; | |||
|
|||
mod update; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably name it update_schema.rs
avoid confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is best practice to infer this from the folder name. For example codebases such as Datafusion or iceberg-rust, files named metadata are just called metadata.rs
under different folders (ex. manifest, puffin, etc.) Unless it becomes ambiguous with other names in the same folder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very new to rust naming convention, thanks for the context! In this case, probably schema.rs
is a better name?
I'm adding a new file update_statistics.rs
under the same folder in this PR: #1359 I'll probably rename it to statistics.rs
, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No the folder gives context, so schema/update.rs -> updating schema. with transaction/update_statistics, there is no context given to what the file is doing if it is called statistics.rs, so update_statistics is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, thanks for the explanation!
/// This method returns a reference to `Self` to allow for method chaining. | ||
fn add_column( | ||
&mut self, | ||
column_name: Vec<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any considerations that we do not want to take the column name string and then find the parent via schema like iceberg-java?
I did notice that iceberg-python followed this pattern, and would love to understand the context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe to avoid having dot in names being a problem, not sure if there is another reason though.
column_name: Vec<String>, | ||
field_type: Type, | ||
doc: Option<String>, | ||
required: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a recent PR to support default values in UpdateSchema, it would be good to port that to iceberg-rs as well: apache/iceberg@602c35a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would also be nice in a follow up pull request. I created an issue for that here.
/// # Returns | ||
/// | ||
/// An empty Ok(()) on success. | ||
pub fn set_column_requirement( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to make this private and add APIs like make_column_optional
to call it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just changing the function name would be fine? what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of something like
pub fn require_column(col_name) { set_column_requirement(col_name, true) }
pub fn make_column_optional(col_name) {set_column_requirement(col_name, false)}
fn set_column_requirement { // this function }
|
||
#[allow(dead_code)] | ||
#[derive(Debug)] | ||
pub struct Move { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like move
related functions are not implemented yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this will be implemented in a follow up pull request.
Which issue does this PR close?
SchemaUpdate
logic to Iceberg-Rust #697What changes are included in this PR?
Added basic functionality to UpdateSchema. Wanted to split it up in two parts.
Are these changes tested?