-
Notifications
You must be signed in to change notification settings - Fork 67
Open
Labels
featurea feature request or enhancementa feature request or enhancement
Description
I have just completed running a kaggle challenge for my machine learning class. I had a surprise that the training data is sorted by the variable used for strata:
# Here is code to reproduce
set.seed(921)
d <- tibble(x=runif(100), y=sample(c("y", "n"), 100, replace=TRUE))
d_split <- initial_split(d, strata=y)
d_tr <- training(d_split)
d_tr$y
d_ts <- testing(d_split)
d_ts$y
and we find
> d_tr$y
[1] "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n"
[13] "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n"
[25] "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n"
[37] "n" "n" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y"
[49] "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y"
[61] "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y"
[73] "y" "y"
although test set is not (luckily for me!)
d_ts$y
[1] "n" "n" "n" "n" "y" "y" "n" "y" "y" "y" "y" "y"
[13] "y" "y" "y" "n" "n" "n" "n" "n" "y" "y" "n" "y"
[25] "n" "n"
I'm not sure if this is intentional but it would be better to have the default be that these are in a random order, with an input parameter to sort being optional.
Metadata
Metadata
Assignees
Labels
featurea feature request or enhancementa feature request or enhancement