-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat: Implement Spark function space
#19610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if args.args.len() != 1 { | ||
| return plan_err!("space expects exactly 1 argument"); | ||
| } | ||
| make_scalar_function(spark_space, vec![])(&args.args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that space is quite often invoked with a literal argument, e.g., space(2). The current implementation converts the scalar value into an array for each batch and then calls " ".repeat(m as usize) for each row, resulting in the same string being allocated each time. This could be optimized in the scalar case to build the string once rather than per row and then append the same string to the builder repeatedly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, the return type in this case could be a scalar not an array
| if m < 0 { | ||
| String::new() | ||
| } else { | ||
| " ".repeat(m as usize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allocates a new string per row, which is discarded after being appended to the array. It would be more efficient for this function to build the string buffer and offsets directly.
| # under the License. | ||
|
|
||
| query T | ||
| SELECT space(1::INT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add tests that use array inputs. The tests currently only cover scalar inputs.
Fix slt
fix slt
fix slt
fix slt
|
@andygrove thanks for the review, the benchmark test result in Comet is still not ideal.
|
|
|
||
| fn spark_space_array_inner(array: &Int32Array) -> StringArray { | ||
| let values = array.values(); | ||
| let data_capacity = values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kazantsev-maksim just thinking aloud if we can iterate values only once?
fn spark_space_array_inner(array: &Int32Array) -> StringArray {
let mut builder = StringBuilder::new(array.len());
let mut space_buf = String::new();
for v in array.iter() {
match v {
None => builder.append_null(),
Some(l) if *l > 0 => {
let l = *l as usize;
if space_buf.len() < l {
space_buf = " ".repeat(l);
}
builder.append_value(&space_buf[..l]);
}
Some(_) => builder.append_value(""),
}
}
builder.finish()
}
something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@comphead thanks, performance has improved!
|
Thanks @kazantsev-maksim and @andygrove for the review |

Which issue does this PR close?
Rationale for this change
Add new function: https://spark.apache.org/docs/latest/api/sql/index.html#space
What changes are included in this PR?
Are these changes tested?
Yes, tests added as part of this PR.
Are there any user-facing changes?
No, these are new function.