-
Notifications
You must be signed in to change notification settings - Fork 67
Description
Feature
I was reminded about #438 by the GitHub lock bot, an issue where a user was surprised that vfold_cv() (and eventually make_strata()) "didn't stratify" (or rather, treated the data as only having one stratum) when the stratification variable only had one class above the pooling threshold.
I think rsample is doing the right thing here, and behaving as documented, but this behavior is still a bit surprising. Would it be possible for make_strata() to warn when it only returns a single stratum? I imagine this is almost always unintentional, as users wouldn't specify a stratification variable if they thought it would go unused.
Another consideration here is that, even if only one stratum is created, the rset objects still contain a strata attribute. As a result, when printed these objects claim that they were created "using stratification":
data.frame(
x = rnorm(100),
y = c(rep("a", 99), "b")
) |>
rsample::vfold_cv(strata = y)
#> # 10-fold cross-validation using stratification
#> # A tibble: 10 × 2
#> splits id
#> <list> <chr>
#> 1 <split [90/10]> Fold01
#> 2 <split [90/10]> Fold02
#> 3 <split [90/10]> Fold03
#> 4 <split [90/10]> Fold04
#> 5 <split [90/10]> Fold05
#> 6 <split [90/10]> Fold06
#> 7 <split [90/10]> Fold07
#> 8 <split [90/10]> Fold08
#> 9 <split [90/10]> Fold09
#> 10 <split [90/10]> Fold10Created on 2023-07-27 with reprex v2.0.2
This might be a bit misleading, as the sampling here didn't depend on the y value at all. Would it make sense to drop the strata attribute if only one stratum is created?
Finally, would it make sense for the categorical branch of make_strata to provide a message listing the categories that get "pooled" together, and which stratum they were pooled into? This might help users catch processing mistakes, if they weren't expecting to have any rare classes that would get automatically pooled. This might be too noisy though, and not as useful as warning about "single stratum" cases.