Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sumLong bug in ColumnStats.scala and TestTableStatsSinglePathMain.scala #2

Open
BrentDorsey opened this issue Mar 5, 2016 · 0 comments

Comments

@BrentDorsey
Copy link

Thanks for sharing, this performs significantly better than what I was using! While validating the getFirstPassStat statistics on our data I discovered a sumLong bug in ColumnStats.scala Part B.1.1.

ColumnStats.scala - Because the sumLong calculation is happening after the reduce the bug returns the sum of the unique values from the column instead of summing all the values in the column. The fix is simply multiplying the unique column values by the number of times the value appears in the partition.

Bug: sumLong += colLongValue
Fix: sumLong += (colLongValue * colCount)

The following else if adds support for Double:

else if (colValue.isInstanceOf[Double]) {
val colDoubleValue = colValue.asInstanceOf[Double]
if (maxDouble colDoubleValue) minDouble = colDoubleValue
sumDouble += (colDoubleValue * colCount)
}

TestTableStatsSinglePathMain.scala - Because all the id values are unique the sumLong assertion isn't catching the bug. Adding the following sumLong test for age:

assertResult(98l)(firstPassStats.columnStatsMap(2).sumLong)

Fails the test returning:

  • run table stats on sample data *** FAILED *** Expected 98, but got 38

98 = 20 + 20 + 20 + 20 + 10 + 8
38 = 20 + 10 + 8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant