-
Notifications
You must be signed in to change notification settings - Fork 77
Lazy statistics for ValueColumn #1636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
9e986ce
276c9be
bacc395
4d5d714
bedea0e
c0adc08
38b26c3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,14 +8,26 @@ import org.jetbrains.kotlinx.dataframe.columns.ValueColumn | |
| import kotlin.reflect.KType | ||
| import kotlin.reflect.full.withNullability | ||
|
|
||
| public class WrappedStatistic( | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this class should not be public, should it?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this class should not be public, should it? Also, I think, if you make the other a |
||
| public var wasComputedSkippingNaN: Boolean = false, | ||
| public var wasComputedNotSkippingNaN: Boolean = false, | ||
| public var statisticComputedSkippingNaN: Any? = null, | ||
| public var statisticComputedNotSkippingNaN: Any? = null, | ||
| ) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what about |
||
|
|
||
| internal interface ValueColumnInternal<T> : ValueColumn<T> { | ||
| val max: WrappedStatistic | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would make this a |
||
| } | ||
|
|
||
| internal open class ValueColumnImpl<T>( | ||
| values: List<T>, | ||
| name: String, | ||
| type: KType, | ||
| val defaultValue: T? = null, | ||
| distinct: Lazy<Set<T>>? = null, | ||
| ) : DataColumnImpl<T>(values, name, type, distinct), | ||
| ValueColumn<T> { | ||
| ValueColumn<T>, | ||
| ValueColumnInternal<T> { | ||
|
|
||
| override fun distinct() = ValueColumnImpl(toSet().toList(), name, type, defaultValue, distinct) | ||
|
|
||
|
|
@@ -48,10 +60,13 @@ internal open class ValueColumnImpl<T>( | |
| override fun defaultValue() = defaultValue | ||
|
|
||
| override fun forceResolve() = ResolvingValueColumn(this) | ||
|
|
||
| override val max = WrappedStatistic() | ||
| } | ||
|
|
||
| internal class ResolvingValueColumn<T>(override val source: ValueColumn<T>) : | ||
| ValueColumn<T> by source, | ||
| ValueColumnInternal<T>, | ||
| ForceResolvedColumn<T> { | ||
|
|
||
| override fun resolve(context: ColumnResolutionContext) = super<ValueColumn>.resolve(context) | ||
|
|
@@ -70,4 +85,6 @@ internal class ResolvingValueColumn<T>(override val source: ValueColumn<T>) : | |
| override fun equals(other: Any?) = source.checkEquals(other) | ||
|
|
||
| override fun hashCode(): Int = source.hashCode() | ||
|
|
||
| override val max = WrappedStatistic() | ||
| } | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please remove this from the commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so fond of this solution, as it requires a lot of refactoring in other functions, plus it does not work when you write
df.max { myCol }, as I mentioned in #1492 (comment)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead. I'd do this check inside the original
aggregateSingleColumn(). EachAggregatorhas anamewhich you could use to query theValueColumnInternalfor the rightWrappedStatisticif they are stored in a Map<String, WrappedStatistic>inValueColumnImpl. Though I suppose eachAggregatorwill also need to store any other provided arguments likeskipNaN: Booleanandpercentile: Doublewhen needed... In aMap<String, Any?>` maybe?That way we could store our "Statistics Cache" in
ValueColumnImplas aso the result cache could look like:
The challenge may lie in doing this neatly ;P
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reviewing!
Imo using
Map<String, Map<Map<String, Any?>, Any?>>introduces a problem,making a query to this structure does not allow to know if the statistic was computed 'in the past'.
Computing the stat using
aggregateSequenceimplies that the stat can be null, so making a query and getting null does not tell me whether the stat was computed yet.Maybe it could be
Map<String, Map<Map<String, Any?>, WrappedStatistic>>where WrappedStatistic has two fileds :
wasComputed: BooleanandactualStatistic: Any??