Sorting of strata in training data from initial_split

I have just completed running a kaggle challenge for my machine learning class. I had a surprise that the training data is sorted by the variable used for strata:

```
# Here is code to reproduce
set.seed(921)
d <- tibble(x=runif(100), y=sample(c("y", "n"), 100, replace=TRUE))
d_split <- initial_split(d, strata=y)
d_tr <- training(d_split)
d_tr$y
d_ts <- testing(d_split)
d_ts$y

```
and we find

```
> d_tr$y
 [1] "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n"
[13] "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n"
[25] "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n"
[37] "n" "n" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y"
[49] "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y"
[61] "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y"
[73] "y" "y"
```

although test set is not (luckily for me!)

```
d_ts$y
 [1] "n" "n" "n" "n" "y" "y" "n" "y" "y" "y" "y" "y"
[13] "y" "y" "y" "n" "n" "n" "n" "n" "y" "y" "n" "y"
[25] "n" "n"

```

I'm not sure if this is intentional but it would be better to have the default be that these are in a random order, with an input parameter to sort being optional.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sorting of strata in training data from initial_split #484

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sorting of strata in training data from initial_split #484

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions