-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disaggregating counts #279
Comments
You can get some of what you want directly with suppressPackageStartupMessages(library(tidyverse))
CO2 %>%
count(Type, Treatment)
#> Source: local data frame [4 x 3]
#> Groups: Type [?]
#>
#> Type Treatment n
#> <fctr> <fctr> <int>
#> 1 Quebec nonchilled 21
#> 2 Quebec chilled 21
#> 3 Mississippi nonchilled 21
#> 4 Mississippi chilled 21
CO2 %>%
count(Plant, Type, Treatment)
#> Source: local data frame [12 x 4]
#> Groups: Plant, Type [?]
#>
#> Plant Type Treatment n
#> <ord> <fctr> <fctr> <int>
#> 1 Qn1 Quebec nonchilled 7
#> 2 Qn2 Quebec nonchilled 7
#> 3 Qn3 Quebec nonchilled 7
#> 4 Qc1 Quebec chilled 7
#> 5 Qc3 Quebec chilled 7
#> 6 Qc2 Quebec chilled 7
#> 7 Mn3 Mississippi nonchilled 7
#> 8 Mn2 Mississippi nonchilled 7
#> 9 Mn1 Mississippi nonchilled 7
#> 10 Mc2 Mississippi chilled 7
#> 11 Mc3 Mississippi chilled 7
#> 12 Mc1 Mississippi chilled 7 |
Ah, that's a nice way to form these aggregated data frames. So I guess what I'm looking for is an |
I misunderstood you. How about this then? suppressPackageStartupMessages(library(tidyverse))
(df1 <- CO2 %>%
count(Type, Treatment))
#> # A tibble: 4 × 3
#> Type Treatment n
#> <fctr> <fctr> <int>
#> 1 Quebec nonchilled 21
#> 2 Quebec chilled 21
#> 3 Mississippi nonchilled 21
#> 4 Mississippi chilled 21
df1 %>%
mutate(ids = map(n, seq_len)) %>%
unnest()
#> # A tibble: 84 × 4
#> Type Treatment n ids
#> <fctr> <fctr> <int> <int>
#> 1 Quebec nonchilled 21 1
#> 2 Quebec nonchilled 21 2
#> 3 Quebec nonchilled 21 3
#> 4 Quebec nonchilled 21 4
#> 5 Quebec nonchilled 21 5
#> 6 Quebec nonchilled 21 6
#> 7 Quebec nonchilled 21 7
#> 8 Quebec nonchilled 21 8
#> 9 Quebec nonchilled 21 9
#> 10 Quebec nonchilled 21 10
#> # ... with 74 more rows
I'm not sure how this would work because the pseudo-observations manufactured above have no distinguishing values within each group that has been disaggregated. What would you join on? |
Yeah, that does the trick. That's a good point about the join. The identifier would be lost in the aggregation and there's really no way to get it back. I guess the better use case is thinking about subsequent modeling and visualization. There are some work arounds (e.g. |
Here's a simple implementation: explode <- function(x, weight_var) {
w <- rlang::eval_tidy(rlang::enquo(weight_var), x)
x[rep(seq_len(nrow(x)), w), , drop = FALSE]
}
df <- tibble::tibble(x = c("a", "b"), y = c(1, 3))
explode(df, y) Needs to check that |
More fuller featured implementation. Any ideas for a better name than explode? reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2017-11-16
library(dplyr, warn.conflicts = FALSE)
library(rlang)
explode <- function(df, w, .id = NULL) {
w <- pull(mutate(df, `_weight` = !!enquo(w)))
if (!is.numeric(w)) {
stop("`w` must evalute to a numeric vector", call. = FALSE)
}
df <- df[rep(seq_len(nrow(df)), w), , drop = FALSE]
if (!is.null(.id)) {
df[[.id]] <- sequence(w)
}
df
}
df <- tibble::tibble(x = c("a", "b"), y = c(1, 2))
explode(df, y)
#> # A tibble: 3 x 2
#> x y
#> <chr> <dbl>
#> 1 a 1
#> 2 b 2
#> 3 b 2
explode(df, 2/y)
#> # A tibble: 3 x 2
#> x y
#> <chr> <dbl>
#> 1 a 1
#> 2 a 1
#> 3 b 2
explode(df, 2)
#> # A tibble: 4 x 2
#> x y
#> <chr> <dbl>
#> 1 a 1
#> 2 a 1
#> 3 b 2
#> 4 b 2
explode(df, 2, .id = "id")
#> # A tibble: 4 x 3
#> x y id
#> <chr> <dbl> <int>
#> 1 a 1 1
#> 2 a 1 2
#> 3 b 2 1
#> 4 b 2 2
df <- tibble::tibble(x = c("a", "b", "b"), y = c(1, 2, 1))
df %>% group_by(x) %>% explode(y, .id = "id")
#> # A tibble: 4 x 3
#> # Groups: x [2]
#> x y id
#> <chr> <dbl> <int>
#> 1 a 1 1
#> 2 b 2 1
#> 3 b 2 2
#> 4 b 1 1
df %>% group_by(x) %>% explode(max(y), .id = "id")
#> # A tibble: 5 x 3
#> # Groups: x [2]
#> x y id
#> <chr> <dbl> <int>
#> 1 a 1 1
#> 2 b 2 1
#> 3 b 2 2
#> 4 b 1 1
#> 5 b 1 2 |
Thesaurus session yields some good leads: expand, amplify And some near misses: bloat, exaggerate, disintegrate |
I went with |
What is the best way to expand out a contingency table (Case C, below) or a data frame with a count variable (Case B) into the taller data frame where each row is one of those cases that is aggregated into the count (Case A)? It seems to me you can go from C to B via functions in
tidyr
, but I'm wondering if there is a cleaner way to go from B to A than dipping intorep()
and all that. Perhaps agroup_by() %>% disaggregate()
or anuntally()
?While this operation seems like a bad idea from an efficiency standpoint, I can think of two use cases: transforming the data frame in preparation for subsequent visualization or modeling functions that expect that format, and in situations where you want to join with additional individual-level variables.
@ismayc
The text was updated successfully, but these errors were encountered: