Skip to content

Could make_strata() warn (or remove the strata attribute) when only returning a single strata? Or message when pooling at all? #441

@mikemahoney218

Description

@mikemahoney218

Feature

I was reminded about #438 by the GitHub lock bot, an issue where a user was surprised that vfold_cv() (and eventually make_strata()) "didn't stratify" (or rather, treated the data as only having one stratum) when the stratification variable only had one class above the pooling threshold.

I think rsample is doing the right thing here, and behaving as documented, but this behavior is still a bit surprising. Would it be possible for make_strata() to warn when it only returns a single stratum? I imagine this is almost always unintentional, as users wouldn't specify a stratification variable if they thought it would go unused.

Another consideration here is that, even if only one stratum is created, the rset objects still contain a strata attribute. As a result, when printed these objects claim that they were created "using stratification":

data.frame(
  x = rnorm(100), 
  y = c(rep("a", 99), "b")
) |> 
  rsample::vfold_cv(strata = y)
#> #  10-fold cross-validation using stratification 
#> # A tibble: 10 × 2
#>    splits          id    
#>    <list>          <chr> 
#>  1 <split [90/10]> Fold01
#>  2 <split [90/10]> Fold02
#>  3 <split [90/10]> Fold03
#>  4 <split [90/10]> Fold04
#>  5 <split [90/10]> Fold05
#>  6 <split [90/10]> Fold06
#>  7 <split [90/10]> Fold07
#>  8 <split [90/10]> Fold08
#>  9 <split [90/10]> Fold09
#> 10 <split [90/10]> Fold10

Created on 2023-07-27 with reprex v2.0.2

This might be a bit misleading, as the sampling here didn't depend on the y value at all. Would it make sense to drop the strata attribute if only one stratum is created?

Finally, would it make sense for the categorical branch of make_strata to provide a message listing the categories that get "pooled" together, and which stratum they were pooled into? This might help users catch processing mistakes, if they weren't expecting to have any rare classes that would get automatically pooled. This might be too noisy though, and not as useful as warning about "single stratum" cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurea feature request or enhancement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions