Skip to content

Feat: support bit_count function #1602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

kazantsev-maksim
Copy link
Contributor

@kazantsev-maksim kazantsev-maksim commented Apr 2, 2025

Which issue does this PR close?

Related to Epic: #240
bit_count: SELECT bit_count(0) => 0
DataFusionComet bit_count has same behavior with Spark 's bit_count function
Spark: https://spark.apache.org/docs/latest/api/sql/index.html#bit_count

Closes #.

Rationale for this change

Defined under Epic: #240

What changes are included in this PR?

bitwise_count.rs: impl for bit_count function
planner.rs: Maps Spark 's bit_count function to DataFusionComet bit_count physical expression from Spark physical expression
expr.proto: bit_count has been added,
QueryPlanSerde.scala: bit_count pattern matching case has been added,
CometExpressionSuite.scala: A new UT has been added for bit_count function.

How are these changes tested?

A new UT has been added.

val table = "bitwise_count_test"
withTable(table) {
sql(s"create table $table(col1 long, col2 int, col3 short, col4 byte) using parquet")
sql(s"insert into $table values(1111, 2222, 17, 7)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind adding random number cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Added tests with random data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a test with a Parquet file not created by Spark (see makeParquetFileAllTypes) and see if this reports correct results with unsigned int columns?

@codecov-commenter
Copy link

codecov-commenter commented Apr 2, 2025

Codecov Report

Attention: Patch coverage is 80.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 58.21%. Comparing base (f09f8af) to head (6c09a10).
Report is 191 commits behind head on main.

Files with missing lines Patch % Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1602      +/-   ##
============================================
+ Coverage     56.12%   58.21%   +2.08%     
- Complexity      976     1108     +132     
============================================
  Files           119      130      +11     
  Lines         11743    12703     +960     
  Branches       2251     2377     +126     
============================================
+ Hits           6591     7395     +804     
- Misses         4012     4122     +110     
- Partials       1140     1186      +46     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that we have this one dragged long and I hope we can merge soon, but one more question

fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
let arg = self.arg.evaluate(batch)?;
match arg {
ColumnarValue::Array(array) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any possibility that we get a dictionary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems impossible to me, but if you have a case study on how this can be tested, I'm ready to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to have many repeated values in order to create dictionary values
If you use something like https://github.com/apache/datafusion-comet/blob/main/spark/src/test/scala/org/apache/spark/sql/CometTestBase.scala#L940
you should be able to test it

@mbutrovich
Copy link
Contributor

mbutrovich commented Apr 30, 2025

DataFusion's bit_count has same behavior with Spark 's bit_count function Spark

If this is the case, can we delegate to a ScalarFunc expression instead of creating a new one? Similar to:
https://github.com/apache/datafusion-comet/pull/1700/files#diff-602f1658f4020a9dc8a47f49ac1411735d56d6612aa971751435104e301a9035

Kazantsev Maksim added 3 commits May 1, 2025 18:09
@kazantsev-maksim
Copy link
Contributor Author

@mbutrovich I couldn't find a built-in implementation of bit_count in the DataFusion project, but i rewrote it using scalarFunc without adding a new proto expr.

let operand = $OPERAND
.as_any()
.downcast_ref::<$DT>()
.expect("compute_op failed to downcast array");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps report $DT as well in case of failure?

val table = "bitwise_count_test"
withTable(table) {
sql(s"create table $table(col1 long, col2 int, col3 short, col4 byte) using parquet")
sql(s"insert into $table values(1111, 2222, 17, 7)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a test with a Parquet file not created by Spark (see makeParquetFileAllTypes) and see if this reports correct results with unsigned int columns?

Kazantsev Maksim added 2 commits May 14, 2025 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants