Skip to content

sumLong bug in ColumnStats.scala and TestTableStatsSinglePathMain.scala  #2

Open
@BrentDorsey

Description

@BrentDorsey

Thanks for sharing, this performs significantly better than what I was using! While validating the getFirstPassStat statistics on our data I discovered a sumLong bug in ColumnStats.scala Part B.1.1.

ColumnStats.scala - Because the sumLong calculation is happening after the reduce the bug returns the sum of the unique values from the column instead of summing all the values in the column. The fix is simply multiplying the unique column values by the number of times the value appears in the partition.

Bug: sumLong += colLongValue
Fix: sumLong += (colLongValue * colCount)

The following else if adds support for Double:

else if (colValue.isInstanceOf[Double]) {
val colDoubleValue = colValue.asInstanceOf[Double]
if (maxDouble colDoubleValue) minDouble = colDoubleValue
sumDouble += (colDoubleValue * colCount)
}

TestTableStatsSinglePathMain.scala - Because all the id values are unique the sumLong assertion isn't catching the bug. Adding the following sumLong test for age:

assertResult(98l)(firstPassStats.columnStatsMap(2).sumLong)

Fails the test returning:

  • run table stats on sample data *** FAILED *** Expected 98, but got 38

98 = 20 + 20 + 20 + 20 + 10 + 8
38 = 20 + 10 + 8

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions