Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disaggregating counts #279

Closed
andrewpbray opened this issue Mar 7, 2017 · 9 comments
Closed

Disaggregating counts #279

andrewpbray opened this issue Mar 7, 2017 · 9 comments
Labels
feature a feature request or enhancement

Comments

@andrewpbray
Copy link

andrewpbray commented Mar 7, 2017

What is the best way to expand out a contingency table (Case C, below) or a data frame with a count variable (Case B) into the taller data frame where each row is one of those cases that is aggregated into the count (Case A)? It seems to me you can go from C to B via functions in tidyr, but I'm wondering if there is a cleaner way to go from B to A than dipping into rep() and all that. Perhaps a group_by() %>% disaggregate() or an untally()?

While this operation seems like a bad idea from an efficiency standpoint, I can think of two use cases: transforming the data frame in preparation for subsequent visualization or modeling functions that expect that format, and in situations where you want to join with additional individual-level variables.

library(tidyverse)

# Case A: tidy data where the case is a single plant
CO2 %>%
  select(Plant:Treatment) %>%
  glimpse()
# Observations: 84
# Variables: 3
# $ Plant     <ord> Qn1, Qn1, Qn1, Qn1, Qn1, Qn1, Qn1, Qn2, Qn2, Q...
# $ Type      <fctr> Quebec, Quebec, Quebec, Quebec, Quebec, Quebe...
# $ Treatment <fctr> nonchilled, nonchilled, nonchilled, nonchille...

# Case B: tidy data where the case is the plantXtypeXtreatment combo
CO2 %>%
  group_by(Plant, Type, Treatment) %>%
  summarize(count = n())
# Source: local data frame [12 x 4]
# Groups: Plant, Type [?]
# 
# Plant        Type  Treatment count
# <ord>      <fctr>     <fctr> <int>
# 1    Qn1      Quebec nonchilled     7
# 2    Qn2      Quebec nonchilled     7
# 3    Qn3      Quebec nonchilled     7
# 4    Qc1      Quebec    chilled     7
# 5    Qc3      Quebec    chilled     7

# Case C: non-tidy contingency table
CO2 %>%
  select(Type, Treatment) %>%
  table()
#                     Treatment
# Type          nonchilled chilled
# Quebec              21      21
# Mississippi         21      21

@ismayc

@jennybc
Copy link
Member

jennybc commented Mar 7, 2017

You can get some of what you want directly with dplyr::count().

suppressPackageStartupMessages(library(tidyverse))

CO2 %>%
  count(Type, Treatment)
#> Source: local data frame [4 x 3]
#> Groups: Type [?]
#> 
#>          Type  Treatment     n
#>        <fctr>     <fctr> <int>
#> 1      Quebec nonchilled    21
#> 2      Quebec    chilled    21
#> 3 Mississippi nonchilled    21
#> 4 Mississippi    chilled    21

CO2 %>% 
  count(Plant, Type, Treatment)
#> Source: local data frame [12 x 4]
#> Groups: Plant, Type [?]
#> 
#>    Plant        Type  Treatment     n
#>    <ord>      <fctr>     <fctr> <int>
#> 1    Qn1      Quebec nonchilled     7
#> 2    Qn2      Quebec nonchilled     7
#> 3    Qn3      Quebec nonchilled     7
#> 4    Qc1      Quebec    chilled     7
#> 5    Qc3      Quebec    chilled     7
#> 6    Qc2      Quebec    chilled     7
#> 7    Mn3 Mississippi nonchilled     7
#> 8    Mn2 Mississippi nonchilled     7
#> 9    Mn1 Mississippi nonchilled     7
#> 10   Mc2 Mississippi    chilled     7
#> 11   Mc3 Mississippi    chilled     7
#> 12   Mc1 Mississippi    chilled     7

@andrewpbray
Copy link
Author

andrewpbray commented Mar 7, 2017

Ah, that's a nice way to form these aggregated data frames. So I guess what I'm looking for is an uncount() to reverse that operation.

@jennybc
Copy link
Member

jennybc commented Mar 7, 2017

I misunderstood you. How about this then?

suppressPackageStartupMessages(library(tidyverse))

(df1 <- CO2 %>%
    count(Type, Treatment))
#> # A tibble: 4 × 3
#>          Type  Treatment     n
#>        <fctr>     <fctr> <int>
#> 1      Quebec nonchilled    21
#> 2      Quebec    chilled    21
#> 3 Mississippi nonchilled    21
#> 4 Mississippi    chilled    21

df1 %>% 
  mutate(ids = map(n, seq_len)) %>% 
  unnest()
#> # A tibble: 84 × 4
#>      Type  Treatment     n   ids
#>    <fctr>     <fctr> <int> <int>
#> 1  Quebec nonchilled    21     1
#> 2  Quebec nonchilled    21     2
#> 3  Quebec nonchilled    21     3
#> 4  Quebec nonchilled    21     4
#> 5  Quebec nonchilled    21     5
#> 6  Quebec nonchilled    21     6
#> 7  Quebec nonchilled    21     7
#> 8  Quebec nonchilled    21     8
#> 9  Quebec nonchilled    21     9
#> 10 Quebec nonchilled    21    10
#> # ... with 74 more rows

situations where you want to join with additional individual-level variables

I'm not sure how this would work because the pseudo-observations manufactured above have no distinguishing values within each group that has been disaggregated. What would you join on?

@andrewpbray
Copy link
Author

Yeah, that does the trick.

That's a good point about the join. The identifier would be lost in the aggregation and there's really no way to get it back. I guess the better use case is thinking about subsequent modeling and visualization. There are some work arounds (e.g. stat = identity) but for, say, ANOVA, seems like you'd need that taller format.

@hadley hadley added the feature a feature request or enhancement label Jun 23, 2017
@hadley
Copy link
Member

hadley commented Nov 16, 2017

Here's a simple implementation:

explode <- function(x, weight_var) {
  w <- rlang::eval_tidy(rlang::enquo(weight_var), x)
  x[rep(seq_len(nrow(x)), w), , drop = FALSE]
}

df <- tibble::tibble(x = c("a", "b"), y = c(1, 3))
explode(df, y)

Needs to check that w is a numeric vector the same length as x. For grouped tables, probably need to evaluate expression in grouped context, i.e. just use mutate(). Make optional to add .id variable?

@hadley
Copy link
Member

hadley commented Nov 16, 2017

More fuller featured implementation. Any ideas for a better name than explode?

reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2017-11-16

library(dplyr, warn.conflicts = FALSE)
library(rlang)
explode <- function(df, w, .id = NULL) {
  w <- pull(mutate(df, `_weight` = !!enquo(w)))
  
  if (!is.numeric(w)) {
    stop("`w` must evalute to a numeric vector", call. = FALSE)
  }
  
  df <- df[rep(seq_len(nrow(df)), w), , drop = FALSE]
  
  if (!is.null(.id)) {
    df[[.id]] <- sequence(w)
  }
  
  df
}
df <- tibble::tibble(x = c("a", "b"), y = c(1, 2))
explode(df, y)
#> # A tibble: 3 x 2
#>       x     y
#>   <chr> <dbl>
#> 1     a     1
#> 2     b     2
#> 3     b     2
explode(df, 2/y)
#> # A tibble: 3 x 2
#>       x     y
#>   <chr> <dbl>
#> 1     a     1
#> 2     a     1
#> 3     b     2
explode(df, 2)
#> # A tibble: 4 x 2
#>       x     y
#>   <chr> <dbl>
#> 1     a     1
#> 2     a     1
#> 3     b     2
#> 4     b     2
explode(df, 2, .id = "id")
#> # A tibble: 4 x 3
#>       x     y    id
#>   <chr> <dbl> <int>
#> 1     a     1     1
#> 2     a     1     2
#> 3     b     2     1
#> 4     b     2     2
df <- tibble::tibble(x = c("a", "b", "b"), y = c(1, 2, 1))
df %>% group_by(x) %>% explode(y, .id = "id")
#> # A tibble: 4 x 3
#> # Groups:   x [2]
#>       x     y    id
#>   <chr> <dbl> <int>
#> 1     a     1     1
#> 2     b     2     1
#> 3     b     2     2
#> 4     b     1     1
df %>% group_by(x) %>% explode(max(y), .id = "id")
#> # A tibble: 5 x 3
#> # Groups:   x [2]
#>       x     y    id
#>   <chr> <dbl> <int>
#> 1     a     1     1
#> 2     b     2     1
#> 3     b     2     2
#> 4     b     1     1
#> 5     b     1     2

@jennybc
Copy link
Member

jennybc commented Nov 16, 2017

Thesaurus session yields some good leads: expand, amplify

And some near misses: bloat, exaggerate, disintegrate

@hadley hadley closed this as completed in 5b81db1 Nov 20, 2017
@hadley
Copy link
Member

hadley commented Nov 20, 2017

I went with uncount() since it's simple and expressive. Please kick the tires and let me know if you discover any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants
@hadley @jennybc @andrewpbray and others