Skip to content

Sorting of strata in training data from initial_split #484

@dicook

Description

@dicook

I have just completed running a kaggle challenge for my machine learning class. I had a surprise that the training data is sorted by the variable used for strata:

# Here is code to reproduce
set.seed(921)
d <- tibble(x=runif(100), y=sample(c("y", "n"), 100, replace=TRUE))
d_split <- initial_split(d, strata=y)
d_tr <- training(d_split)
d_tr$y
d_ts <- testing(d_split)
d_ts$y

and we find

> d_tr$y
 [1] "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n"
[13] "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n"
[25] "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n"
[37] "n" "n" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y"
[49] "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y"
[61] "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y"
[73] "y" "y"

although test set is not (luckily for me!)

d_ts$y
 [1] "n" "n" "n" "n" "y" "y" "n" "y" "y" "y" "y" "y"
[13] "y" "y" "y" "n" "n" "n" "n" "n" "y" "y" "n" "y"
[25] "n" "n"

I'm not sure if this is intentional but it would be better to have the default be that these are in a random order, with an input parameter to sort being optional.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurea feature request or enhancement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions