Skip to content
This repository was archived by the owner on May 4, 2019. It is now read-only.

Change unique() to return values in the same ordering as levels for PDAs #237

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nalimilan
Copy link
Member

While the generic unique() method says it preserves the order of appearance,
the ordering of levels is more likely to be useful. In particular, it will
allow StatsModels to use unique() to get levels present in the data in the
user-defined order, with the first level as reference by default.

The new code (inspired by CategoricalArrays) is also more efficient in the
common case where all values are encountered well before the end of the array,
by doing a periodic short-circuiting check.


See JuliaStats/StatsModels.jl#13 (comment). The current behavior was chosen after discussion at #92, but it looks like the issues where mainly about DataArray, not PooledDataArray. It's slightly annoying to deviate from the standard behavior of unique, but I can't think of cases where the order of appearance would be more useful than the ordering of levels, which should be carefully chosen (else using a PDA doesn't make much sense).

The other solution is to provide a separate function for this, but that sounds overkill.

While the generic unique() method says it preserves the order of appearance,
the ordering of levels is more likely to be useful. In particular, it will
allow StatsModels to use unique() to get levels present in the data in the
user-defined order, with the first level as reference by default.

The new code is based on CategoricalArrays.
@nalimilan
Copy link
Member Author

Another approach which I find possibly better would be to keep the current behavior, but define unique(x::AbstractArray, sort::Bool=false). When sort=true, the order of levels would be preserved for PooledDataArray; for other AbstractArray, we would simply call sort!(unique(x)). That would exactly suit our needs (see JuliaStats/StatsModels.jl#14 (comment)), and could be a useful behavior in general. The default method could even be defined in Base.

@ararslan
Copy link
Member

The default method could even be defined in Base.

If we go that route I think it needs to be in Base, otherwise we're committing the grave sin of type piracy (i.e. overloading Base methods with Base types in an external package).

@nalimilan
Copy link
Member Author

Yet another approach is not to change anything, and obtain the ordered set of unique values via intersect(levels(x), unique(x)). The advantage is that it's simple and does not require a new API, the drawback is that unique does not return the most useful result by default (i.e. functions which call unique generically and return results based on that will use a meaningless order while a useful order is available).

@nalimilan
Copy link
Member Author

I think we should merge this and discuss the larger issue of what unique means for AbstractArray later, as we need it to get StatsModels to work with DataArrays.

@ararslan
Copy link
Member

I'm hesitant to deviate from the behavior in Base when overloading a Base function, even if a more useful behavior is possible. intersect(levels(x), unique(x)) seems okay to me, if a bit verbose.

@nalimilan
Copy link
Member Author

Yeah, but that requires defining levels somewhere. Not so urgent I guess if we go with JuliaData/DataFrames.jl#1170 for now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants