Skip to content

Conversation

@CarloMariaProietti
Copy link
Contributor

@CarloMariaProietti CarloMariaProietti commented Dec 19, 2025

Fixed version of #1636

Fixes #1492
The idea is the following:
ValueColumnInternal is an interface for statistic values, which in this way are not exposed as public.
Implementations of ValueColumnInternal contain the actual cache.

It was necessary to have two caches for each stat (for the moment only max) because computing the stat may give different outputs basing on skipNaN boolean parameter.

I implemented the solution by overloading aggregateSingleColumn, this overload exploits the original aggregateSingleColumn by wrapping it so that it is possible to exploit caches.

For the moment there is only max, however it would be easy to do the same with min, sum, mean and median.
For percentile and std it could be done something similar.

internal value class StatisticResult(val value: Any?)

internal interface ValueColumnInternal<T> : ValueColumn<T> {
val statistics: MutableMap<String, MutableMap<Map<String, Any>, StatisticResult>>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit dangerous to expose a mutable map, especially one as complicated as this one, for other parts of the library to modify.

I would move the logic of getting/storing statistics here and only call those functions in Aggregators.kt

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this way ValueColumnInternal could only expose functions like putStatisticCache(name, arguments, value), and getStatisticCacheOrNull(name, arguments) and make the MutableMap private inside ValueColumnImpl. It's less bug prone that way :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it allows to you avoid names like desiredStatisticNotConsideringParameters which I have difficulties with comprehending ;P

Copy link
Contributor Author

@CarloMariaProietti CarloMariaProietti Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit dangerous to expose a mutable map, especially one as complicated as this one, for other parts of the library to modify.

I would move the logic of getting/storing statistics here and only call those functions in Aggregators.kt

I'm not sure I understood. At the moment getting/storing statistics is done in AggregatorAggregationHandler.kt
(aggregateSingleColumn) and in that same function there is also the logic of getting/storing.
The logic should be moved in ValueColumnInternal but why should the functions be called in Aggregators.kt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this way ValueColumnInternal could only expose functions like putStatisticCache(name, arguments, value), and getStatisticCacheOrNull(name, arguments) and make the MutableMap private inside ValueColumnImpl. It's less bug prone that way :)

If the mutable map is private inside ValueColumnImpl, how could I access that field inside ValueColumnInternal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe statisticsCache could be a Map instead of MutableMap and getting/storing functions in ValueColumnInternal could eventually make a cast to MutableMap?

Copy link
Collaborator

@Jolanrensen Jolanrensen Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic should be moved in ValueColumnInternal

No, into ValueColumnImpl. Interfaces describe the structure of a class, an API without implementation if you will. They should not implement something or dictate how it should be implemented.

So just something like:

internal interface ValueColumnInternal<T> : ValueColumn<T> {
    fun putStatisticCache(statName: String, arguments: Map<String, Any>, value: StatisticResult)
    fun getStatisticCacheOrNull(statName: String, arguments: Map<String, Any>): StatisticResult?
}

Then if ValueColumnImpl wants to implement ValueColumnInternal it will need to override those functions and write the logic for it. ValueColumnImpl can then hold the actual Map with the statistic cache but keep it private.

We also have a convention in DataFrame if you add internal functions to a public concept, like ValueColumn, to add an extension function like: internal fun <T> ValueColumn<T>.internal(): ValueColumnInternal<T> = this as ValueColumn<T>. That way you can simply call someValueCol.internal().putStatisticCache(...) from elsewhere in the internal parts of the library, like in the aggregators :)

Is that a bit clearer?

Copy link
Contributor Author

@CarloMariaProietti CarloMariaProietti Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explenation, now this is very clear.
I'm still a bit confused about the following statement :

I would move the logic of getting/storing statistics here and only call those functions in Aggregators.kt

Why should these function be called in Aggregators.kt? Should not they be called in aggregateSingleColumn which is the only function that need that logic?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I'm sorry for the confusion. I indeed meant aggregateSingleColumn :) so AggregatorAggregationHandler.kt

fun getStatisticCacheOrNull(statName: String, arguments: Map<String, Any>): StatisticResult?
}

internal fun <T> ValueColumn<T>.internal() = this as ValueColumnInternal<T>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean, "it breaks for ValueColumnWithParent"? Ah, probably because ValueColumnWithParent only implements ValueColumn... Instead of removing this overload (which doesn't fix the issue), you should better make sure all implementations of ValueColumn also implement ValueColumnInternal :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise, only ValueColumns which specifically use the ValueColumnImpl implementation will work with this statistics cache.

Copy link
Contributor Author

@CarloMariaProietti CarloMariaProietti Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean, "it breaks for ValueColumnWithParent"?

Compilation problem in ValueColumnWithParent.kt due to the following lines;

override fun changeType(type: KType) =
        ValueColumnWithParent(parent, source.internal().changeType(type).asValueColumn())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lazy statistics for columns

2 participants