11--- 
2- title : Correlation utilities 
2+ title : 2. Computing signal correlations 
3+ description : Calculate correlations over space and time between multiple signals. 
34output : rmarkdown::html_vignette 
45vignette : > 
5-   %\VignetteIndexEntry{2. Correlation utilities } 
6+   %\VignetteIndexEntry{2. Computing signal correlations } 
67  %\VignetteEngine{knitr::rmarkdown} 
78  %\VignetteEncoding{UTF-8} 
89--- 
@@ -11,9 +12,8 @@ The covidcast package provides some simple utilities for exploring the
1112correlations between two signals, over space or time, which may be helpful for
1213simple analyses and explorations of data.
1314
14- For these examples, we'll load confirmed cases and deaths to compare against,
15- and restrict our analysis to counties with at least 500 total cases by August
16- 15th.
15+ For these examples, we'll load confirmed case and death rates. and restrict our
16+ analysis to counties with at least 500 total cases by August 15th.
1717
1818``` {r,  message = FALSE}
1919library(covidcast) 
@@ -22,27 +22,30 @@ library(dplyr)
2222start_day <- "2020-03-01" 
2323end_day <- "2020-08-15" 
2424
25- inum  <- suppressMessages(
25+ iprop  <- suppressMessages(
2626  covidcast_signal(data_source = "jhu-csse", 
27-                    signal = "confirmed_7dav_incidence_num ",  
27+                    signal = "confirmed_7dav_incidence_prop ",  
2828                   start_day = start_day, end_day = end_day) 
2929) 
30- summary(inum ) 
30+ summary(iprop ) 
3131
32- dnum  <- suppressMessages(
32+ dprop  <- suppressMessages(
3333  covidcast_signal(data_source = "jhu-csse", 
34-                    signal = "deaths_7dav_incidence_num ",  
34+                    signal = "deaths_7dav_incidence_prop ",  
3535                   start_day = start_day, end_day = end_day) 
3636) 
37- summary(dnum ) 
37+ summary(dprop ) 
3838
3939# Restrict attention to "active" counties with at least 500 total cases  
4040case_num <- 500 
41- geo_values <- inum %>% group_by(geo_value) %>%  
42-   summarize(total = sum(value)) %>%  
43-   filter(total >= case_num) %>% pull(geo_value) 
44- inum_act <- inum %>% filter(geo_value %in% geo_values) 
45- dnum_act <- dnum %>% filter(geo_value %in% geo_values) 
41+ geo_values <- suppressMessages( 
42+   covidcast_signal(data_source = "jhu-csse", 
43+                    signal = "confirmed_cumulative_num",  
44+                    start_day = end_day, end_day = end_day) %>% 
45+   filter(value >= case_num) %>% pull(geo_value) 
46+ ) 
47+ iprop_act <- iprop %>% filter(geo_value %in% geo_values) 
48+ dprop_act <- dprop %>% filter(geo_value %in% geo_values) 
4649``` 
4750
4851## Correlations sliced by time  
@@ -60,91 +63,72 @@ by setting `by = "time_value"`:
6063library(ggplot2) 
6164
6265# Compute correlation per time, over all counties 
63- df_cor1  <- covidcast_cor(inum_act, dnum_act , by = "time_value")
66+ df_cor  <- covidcast_cor(iprop_act, dprop_act , by = "time_value")
6467
6568# Plot the correlation time series 
66- ggplot(df_cor1 , aes(x = time_value, y = value)) + geom_line() +  
67-   labs(title = "Correlation between cases  and deaths ", 
69+ ggplot(df_cor , aes(x = time_value, y = value)) + geom_line() +  
70+   labs(title = "Correlation between case  and death rates ", 
6871       subtitle = sprintf("Per day, over counties with at least %i cases",  
6972                          case_num), 
7073       x = "Date", y = "Correlation")  
7174``` 
72- 
73- (The sudden drop on July 25th is due to a [ sudden change in how New Jersey
74- reported deaths] ( https://github.com/CSSEGISandData/COVID-19/issues/2763 )  being
75- reflected in our data source as large outliers; since the signal is a 7-day
76- average, these outliers last until the beginning of July and affect the reported
77- correlation.)
78- 
79- We might also be interested in how cases now correlate with deaths in the
80- * future* . Using the ` dt_x `  parameter, we can lag cases back 10 days in time, 
81- before calculating correlations:
82- 
83- ``` {r,  warning = FALSE}
84- # Same, but now lag incidence case numbers back 10 days in time 
85- df_cor2 <- covidcast_cor(inum_act, dnum_act, by = "time_value", dt_x = -10) 
86- 
87- # Stack rowwise into one data frame, then plot time series 
88- df_cor <- rbind(df_cor1, df_cor2) 
89- df_cor$dt <- as.factor(c(rep(0, nrow(df_cor1)), rep(-10, nrow(df_cor2)))) 
90- ggplot(df_cor, aes(x = time_value, y = value)) +  
91-   geom_line(aes(color = dt)) +  
92-   labs(title = "Correlation between cases and deaths", 
93-        subtitle = sprintf("Per day, over counties with at least %i cases",  
94-                           case_num), 
95-        x = "Date", y = "Correlation") +  
96-   theme(legend.position = "bottom") 
97- ``` 
98- 
99- We can see that, for the most part, lagging the cases time series back by 10
100- days improves correlations, showing that cases are better correlated with deaths
101- 10 days from now.
102- 
103- We can also look at Spearman (rank) correlation, which is a more robust measure 
104- of correlation: it's invariant to monotone transformations, and doesn't rely on
105- any particular functional form for the dependence between two variables.
75+  
76+ The above plot addresses the question: "on any given day, are case and death
77+ rates linearly associated, over US counties?". We might be interested in 
78+ broadening this question, instead asking: "on any given day, do higher case 
79+ rates tend to associate with higher death rates?", removing the dependence on a 
80+ linear relationship. The latter can be addressed using Spearman correlation, 
81+ accomplished by setting ` method = "spearman" `  in the call to ` covidcast_cor() ` .
82+ Spearman correlation is highly robust and invariant to monotone transformations
83+ (it doesn't rely on any particular functional form for the dependence between 
84+ two variables). 
85+ 
86+ We might also interested in interested in how case rates associate with death
87+ rates in the * future* . Using the ` dt_x `  parameter in ` covidcast_cor() ` , we can 
88+ lag case rates back any number of days we want, before calculating correlations.
10689
10790``` {r,  warning = FALSE}
108- # Repeat this comparison, but now using Spearman (rank) correlation  
109- df_cor1 <- covidcast_cor(inum_act, dnum_act , by = "time_value",  
91+ # Use Spearman correlation, with case rates and 10-day lagged case rates  
92+ df_cor1 <- covidcast_cor(iprop_act, dprop_act , by = "time_value",  
11093                        method = "spearman") 
111- df_cor2 <- covidcast_cor(inum_act, dnum_act , by = "time_value", dt_x = -10, 
94+ df_cor2 <- covidcast_cor(iprop_act, dprop_act , by = "time_value", dt_x = -10, 
11295                        method = "spearman") 
11396
11497# Stack rowwise into one data frame, then plot time series 
11598df_cor <- rbind(df_cor1, df_cor2) 
11699df_cor$dt <- as.factor(c(rep(0, nrow(df_cor1)), rep(-10, nrow(df_cor2)))) 
117100ggplot(df_cor, aes(x = time_value, y = value)) +  
118101  geom_line(aes(color = dt)) +  
119-   labs(title = "Correlation between cases  and deaths ", 
102+   labs(title = "Correlation between case  and death rates ", 
120103       subtitle = sprintf("Per day, over counties with at least %i cases",  
121104                          case_num),  
122105       x = "Date", y = "Correlation") + 
123106  theme(legend.position = "bottom") 
124107``` 
125108
126- The "big dip" is gone (since the Spearman correlation uses ranks and not the
127- actual values, and hence is less sensitive to outliers), and we can again see
128- that lagging the cases time series helps correlations.
109+ We can see that, for the most part, the Spearman measure has bolstered the 
110+ correlations; and generally, lagging the case rates time series back by 10 days 
111+ improves correlations, confirming case rates are better correlated with death 
112+ rates 10 days from now.
129113
130114## Correlations sliced by county  
131115
132116The second option we have is to "slice by location": this calculates, for each 
133117geographic location, correlation between the time series of two signals. This
134118is obtained by setting ` by = "geo_value" ` . We'll again look at correlations 
135- both for observations at the same time and for 10-day lagged cases :
119+ both for observations at the same time and for 10-day lagged case rates :
136120
137121``` {r,  warning = FALSE}
138122# Compute correlation per county, over all times 
139- df_cor1 <- covidcast_cor(inum_act, dnum_act , by = "geo_value") 
140- df_cor2 <- covidcast_cor(inum_act, dnum_act , by = "geo_value", dt_x = -10) 
123+ df_cor1 <- covidcast_cor(iprop_act, dprop_act , by = "geo_value") 
124+ df_cor2 <- covidcast_cor(iprop_act, dprop_act , by = "geo_value", dt_x = -10) 
141125
142126# Stack rowwise into one data frame, then plot densities 
143127df_cor <- rbind(df_cor1, df_cor2) 
144128df_cor$dt <- as.factor(c(rep(0, nrow(df_cor1)), rep(-10, nrow(df_cor2)))) 
145129ggplot(df_cor, aes(value)) +  
146130  geom_density(aes(color = dt, fill = dt), alpha = 0.5) +  
147-   labs(title = "Correlation between cases  and deaths ", 
131+   labs(title = "Correlation between case  and death rates ", 
148132       subtitle = "Computed separately for each county, over all times", 
149133       x = "Date", y = "Density") + 
150134  theme(legend.position = "bottom") 
@@ -162,8 +146,8 @@ attributes(df_cor2)$metadata$geo_type <- "county"
162146class(df_cor2) <- c("covidcast_signal", "data.frame") 
163147
164148# Plot choropleth maps, using the covidcast plotting functionality 
165- plot(df_cor2, title = "Correlations between 10-day lagged cases  and deaths ", 
166-      range = c(-1, 1), choro_col = c("orange","lightblue", "purple")) 
149+ plot(df_cor2, title = "Correlations between 10-day lagged case  and death rates ", 
150+      range = c(-1, 1), choro_col = c("orange",  "lightblue", "purple")) 
167151``` 
168152
169153## More systematic lag analysis  
@@ -177,7 +161,7 @@ this:
177161dt_vec <- -(0:15) 
178162df_list <- vector("list", length(dt_vec)) 
179163for (i in 1:length(dt_vec)) { 
180-   df_list[[i]] <- covidcast_cor(inum_act, dnum_act , dt_x = dt_vec[i], 
164+   df_list[[i]] <- covidcast_cor(iprop_act, dprop_act , dt_x = dt_vec[i], 
181165                               by = "geo_value") 
182166  df_list[[i]]$dt <- dt_vec[i] 
183167} 
@@ -188,11 +172,11 @@ df %>%
188172  group_by(dt) %>% 
189173  summarize(median = median(value, na.rm = TRUE), .groups = "drop_last") %>% 
190174  ggplot(aes(x = dt, y = median)) + geom_line() + geom_point() + 
191-   labs(title = "Median correlation between cases  and deaths ", 
175+   labs(title = "Median correlation between case  and death rates ", 
192176       x = "dt", y = "Correlation") + 
193177  theme(legend.position = "bottom", legend.title = element_blank()) 
194178``` 
195179
196- We can see that the median correlation between cases  and deaths  (where the
180+ We can see that the median correlation between case  and death rates  (where the
197181correlations come from slicing by location) is maximized when we lag the case
198- incidence numbers  back 8 days in time.
182+ incidence rates  back 8 days in time.
0 commit comments