-
Notifications
You must be signed in to change notification settings - Fork 77
Lazy Statistics #1656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Lazy Statistics #1656
Conversation
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/columns/ValueColumnImpl.kt
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/columns/ValueColumnImpl.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/aggregation/aggregators/Aggregator.kt
Outdated
Show resolved
Hide resolved
.../src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/aggregation/aggregators/Aggregators.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/columns/ValueColumnImpl.kt
Outdated
Show resolved
Hide resolved
| internal value class StatisticResult(val value: Any?) | ||
|
|
||
| internal interface ValueColumnInternal<T> : ValueColumn<T> { | ||
| val statistics: MutableMap<String, MutableMap<Map<String, Any>, StatisticResult>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit dangerous to expose a mutable map, especially one as complicated as this one, for other parts of the library to modify.
I would move the logic of getting/storing statistics here and only call those functions in Aggregators.kt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this way ValueColumnInternal could only expose functions like putStatisticCache(name, arguments, value), and getStatisticCacheOrNull(name, arguments) and make the MutableMap private inside ValueColumnImpl. It's less bug prone that way :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and it allows to you avoid names like desiredStatisticNotConsideringParameters which I have difficulties with comprehending ;P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit dangerous to expose a mutable map, especially one as complicated as this one, for other parts of the library to modify.
I would move the logic of getting/storing statistics here and only call those functions in Aggregators.kt
I'm not sure I understood. At the moment getting/storing statistics is done in AggregatorAggregationHandler.kt
(aggregateSingleColumn) and in that same function there is also the logic of getting/storing.
The logic should be moved in ValueColumnInternal but why should the functions be called in Aggregators.kt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this way
ValueColumnInternalcould only expose functions likeputStatisticCache(name, arguments, value), andgetStatisticCacheOrNull(name, arguments)and make the MutableMap private insideValueColumnImpl. It's less bug prone that way :)
If the mutable map is private inside ValueColumnImpl, how could I access that field inside ValueColumnInternal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe statisticsCache could be a Map instead of MutableMap and getting/storing functions in ValueColumnInternal could eventually make a cast to MutableMap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic should be moved in ValueColumnInternal
No, into ValueColumnImpl. Interfaces describe the structure of a class, an API without implementation if you will. They should not implement something or dictate how it should be implemented.
So just something like:
internal interface ValueColumnInternal<T> : ValueColumn<T> {
fun putStatisticCache(statName: String, arguments: Map<String, Any>, value: StatisticResult)
fun getStatisticCacheOrNull(statName: String, arguments: Map<String, Any>): StatisticResult?
}Then if ValueColumnImpl wants to implement ValueColumnInternal it will need to override those functions and write the logic for it. ValueColumnImpl can then hold the actual Map with the statistic cache but keep it private.
We also have a convention in DataFrame if you add internal functions to a public concept, like ValueColumn, to add an extension function like: internal fun <T> ValueColumn<T>.internal(): ValueColumnInternal<T> = this as ValueColumn<T>. That way you can simply call someValueCol.internal().putStatisticCache(...) from elsewhere in the internal parts of the library, like in the aggregators :)
Is that a bit clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explenation, now this is very clear.
I'm still a bit confused about the following statement :
I would move the logic of getting/storing statistics here and only call those functions in Aggregators.kt
Why should these function be called in Aggregators.kt? Should not they be called in aggregateSingleColumn which is the only function that need that logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I'm sorry for the confusion. I indeed meant aggregateSingleColumn :) so AggregatorAggregationHandler.kt
| fun getStatisticCacheOrNull(statName: String, arguments: Map<String, Any>): StatisticResult? | ||
| } | ||
|
|
||
| internal fun <T> ValueColumn<T>.internal() = this as ValueColumnInternal<T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean, "it breaks for ValueColumnWithParent"? Ah, probably because ValueColumnWithParent only implements ValueColumn... Instead of removing this overload (which doesn't fix the issue), you should better make sure all implementations of ValueColumn also implement ValueColumnInternal :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise, only ValueColumns which specifically use the ValueColumnImpl implementation will work with this statistics cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean, "it breaks for
ValueColumnWithParent"?
Compilation problem in ValueColumnWithParent.kt due to the following lines;
override fun changeType(type: KType) =
ValueColumnWithParent(parent, source.internal().changeType(type).asValueColumn())
Fixed version of #1636