Skip to content

Implement PoC block allocation for count accumulator #15642

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

Dandandan
Copy link
Contributor

Which issue does this PR close?

PoC to show a simple method for allocating in blocks

  • Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions functions Changes to functions implementation labels Apr 8, 2025
@Dandandan
Copy link
Contributor Author

Dandandan commented Apr 8, 2025

@Rachelint PoC for accumulators (I also have some PoC for block-allocated GroupValues)

I also made some separate changes for outputting the output from the state in batches, but left it out to keep the changes small.

@Dandandan Dandandan closed this Apr 8, 2025
@Dandandan Dandandan reopened this Apr 8, 2025
@@ -18,6 +18,7 @@
use ahash::RandomState;
use datafusion_common::stats::Precision;
use datafusion_expr::expr::WindowFunction;
use datafusion_expr::groups_accumulator::BLOCK_SIZE;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be based on batch size as well.


// Count is always non null (null inputs just don't contribute to the overall values)
let nulls = None;
let array = PrimitiveArray::<Int64Type>::new(counts.into(), nulls);
// TODO: support emitting batches
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

evaluate and state could be supported to return Result<Vec<ArrayRef>> and Result<Vec<Vec<ArrayRef>>> although this is making a quite large breaking change.

@Rachelint
Copy link
Contributor

Rachelint commented Apr 8, 2025

@Rachelint PoC for accumulators (I also have some PoC for block-allocated GroupValues)

I also made some separate changes for outputting the output from the state in batches, but left it out to keep the changes small.

It seems similar as what was done in #11943 ?

The problem I found after attempt in POC #11943 is we need introduce a row-level operation to get block index and group index from the packed index like https://github.com/apache/datafusion/pull/11943/files#diff-c7c3260048c2f8889d1bde1155bbc252718f07497b25b91f7d2be26b8e2ebf58R464-R465

It still work well when we enable block approach but it will lead to regression when disabling the optimization... So I think if we should avoid such row-level operation in #15591 ?

@Rachelint
Copy link
Contributor

Rachelint commented Apr 8, 2025

Really thanks.

I check the old sketch again, and found it is easy to avoid regression for disabling the optimization cases.

Maybe it is actually too early to consider the cost of packed index because it still can get non-trivial performance improvement according the benchmark result.

I plan to:

  • Just make the old sketch can work again. And improve it to avoid regression for disabling cases.
  • Try to avoid the cost of packed index in follow up prs.

@Dandandan
Copy link
Contributor Author

Dandandan commented Apr 9, 2025

Just make the old sketch can work again. And improve it to avoid regression for disabling cases

The plan sounds good!

I think this PoC mainly shows that accumulators / Group state might be changed individually (without changing other parts).

If there is a part of your code that shows improvement time and memory wise, we should take it and create some follow-up tickets!

@Dandandan
Copy link
Contributor Author

Let's close it for now - I'll be looking if I can contribute part of this later.

@Dandandan Dandandan closed this Apr 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions Changes to functions implementation logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants