Skip to content

Commit b0ec1cd

Browse files
committed
cipi >> paper, recs; me-gen >> {jlme}; py-gen >> req.txt to uv lock
1 parent 0bfb13e commit b0ec1cd

4 files changed

+21
-240
lines changed

qmd/confidence-and-prediction-intervals.qmd

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,12 @@
187187
- [article](https://eranraviv.com/bootstrap-standard-error-estimates-good-news/)
188188
- bootstrap is "based on a weak convergence of moments"
189189
- if you use an estimate based standard deviation of the bootstrap, you are being overly conservative (i.e. overestimate the sd)
190+
- Recommendations ([Paper](https://journals.sagepub.com/doi/full/10.1177/2515245920911881))
191+
- The percentile bootstrap works well when making inferences about trimmed means, quantiles, or correlation coefficient.
192+
- "However, percentile-bootstrap confidence intervals tend to be inaccurate in some situations because the bootstrap sampling distribution is skewed (asymmetric) and biased (consistently shifted away from the population value in one direction)"
193+
- "To address these problems, two major alternatives to the percentile bootstrap have been suggested: the bootstrap-t and the bias-corrected and accelerated (BCa) bootstrap"
194+
- "bootstrap-t can lead to more accurate confidence intervals for the mean and some trimmed means than the percentile bootstrap does, a percentile bootstrap is recommended for inferences about the 20% trimmed mean"
195+
- The BCa approach can be unsatisfactory for relatively small sample sizes
190196
- Packages
191197
- [{]{style="color: #990000"}[ebtools::get_boot_ci](https://ercbk.github.io/ebtools/reference/get_boot_ci.html){style="color: #990000"}[}]{style="color: #990000"}
192198
- [{]{style="color: #990000"}[workboots](https://markjrieke.github.io/workboots/){style="color: #990000"}[}]{style="color: #990000"} - Bootstrap prediction intervals for arbitrary model types from a tidymodel workflow.

qmd/mixed-effects-general.qmd

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,8 @@ fig-cap-location: top
4343
- When you have **variable cluster sizes**, inverse cluster size weights can be specified to ensure that all clusters contribute equally regardless of cluster size which mitigates a loss of power.
4444
- [{]{style="color: #990000"}[gpboost](https://github.com/fabsig/GPBoost){style="color: #990000"}[}]{style="color: #990000"} - Models fixed effects with a **boosted tree** and combines random effects with **Gaussian Processes** somehow. Multiple likelihoods available
4545
- [{]{style="color: #990000"}[skewlmm](https://github.com/fernandalschumacher/skewlmm){style="color: #990000"}[}]{style="color: #990000"} ([Paper](https://arxiv.org/abs/2002.01040)) - Fits **skew robust** linear mixed models, using scale mixture of skew-normal linear mixed models with possible within-subject dependence structure, using an EM-type algorithm.
46+
- [{]{style="color: #990000"}[jlme](https://github.com/yjunechoe/jlme/){style="color: #990000"}[}]{style="color: #990000"} - Fits mixed models in Julia from R using `lmer` and `glmer` syntax
47+
- Supports all kinds of diagnostics and CIs
4648
:::
4749

4850
- Mixed Effects Model = Random Effects model = Multilevel model = Hierarchical model
@@ -64,6 +66,11 @@ fig-cap-location: top
6466
- [Numerical validation as a critical aspect in bringing R to the Clinical Research](https://www.researchgate.net/publication/345778861_Numerical_validation_as_a_critical_aspect_in_bringing_R_to_the_Clinical_Research)
6567
- Slides that show various discrepancies between R output and programs like SAS and SPSS and solutions.
6668
- Procedure for adopting packages into core analysis procedures (i.e. popularity, documentation, author activity, etc.)
69+
- [Explaining Fixed Effects: Random Effects Modeling of Time-Series Cross-Sectional and Panel Data](https://www.cambridge.org/core/journals/political-science-research-and-methods/article/explaining-fixed-effects-random-effects-modeling-of-timeseries-crosssectional-and-panel-data/0334A27557D15848549120FE8ECD8D63)
70+
- Describes Heterogeneity Bias in repeated measure / longitudinal models. This estimate bias occurs in because the vary slopes variable is also included in the fixed effects. In some cases (?), it creates correlation between the fixed effect and error terms.
71+
- Solution: Demeaning, which is described in this [{parameters}]{style="color: #990000"} [vignette](https://easystats.github.io/parameters/articles/demean.html), splits the variable into between (group mean) and within (deviation from group mean) types.
72+
- Need to read the paper. The vignette doesn't describe what fitting a mixed effects model without demeaning looks like. I'd like to compare both models.
73+
- [{]{style="color: #990000"}[performance::check_heterogeneity_bias](https://easystats.github.io/performance/reference/check_heterogeneity_bias.html){style="color: #990000"}[}]{style="color: #990000"} - See the vignette for a better example of its usage.
6774
- Advantages of a mixed model ($y \sim x + (x \;|\; g)$) vs a linear model with an interaction ($y \sim x \ast g$)
6875
- From T.J. Mahr [tweet](https://twitter.com/tjmahr/status/1504124329319096323)
6976
- Conceptual: Assumes participant means are drawn from the same latent population
@@ -210,14 +217,14 @@ fig-cap-location: top
210217
- **Fixed Effects** provide estimates of mean-differences or slopes.
211218
- "Fixed" because they are effects that are constant for each subject/unit
212219
- Should include within-subject variables (e.g. random effect slope variables) and between-subject variables (e.g. gender)
213-
- Level One: variables measured at the most frequently occurring observational unit
214-
- i.e. Vary for each repeated measure of a subject and vary between subjects
220+
- *Level One*: Variables measured at the most frequently occurring observational unit
221+
- i.e. Vary for each repeated measure of a subject and vary between subjects (“within-effect)
215222
- In the dataset, these variables that (for the most part) have different values for each row
216223
- Time-dependent if you have longitudinal data
217224
- For a RE model, these are usually the adjustment variables
218225
- e.g. Conditioning on a confounder
219-
- Level Two: variables measured at the observational unit level
220-
- i.e. Constant for each repeated measure of a subject but vary between each subject
226+
- *Level Two*: Variables measured at the observational unit level
227+
- i.e. Constant for each repeated measure of a subject but vary between each subject ("between"-effect)
221228
- For a RE model, these are usually the treatment variables or variables of interest
222229
- They should contain the information about the between-subject variation
223230
- If a factor variable, it has levels which would not change in replications of the study
@@ -602,7 +609,7 @@ fig-cap-location: top
602609

603610
- [Example 1]{.ribbon-highlight}: [{lme4}]{style="color: #990000"}
604611

605-
- From [Estimating multilevel models for change in R](https://www.alexcernat.com/etimating-multilevel-models-for-change-in-r)
612+
- From [Estimating multilevel models for change in R](https://longitudinalanalysis.com/estimating-and-visualizing-multilevel-models-for-change-in-r/)
606613
- [usl]{.var-text} is UK sociological survey data
607614
- [logincome]{.var-text} is a logged income variable
608615
- [pidp]{.var-text} is the person's id

qmd/python-general.qmd

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -516,6 +516,9 @@
516516
uv add git+https://github.com/psf/requests
517517

518518
uv remove requests
519+
520+
# add deps from a req.txt to lock file
521+
cat requirements.txt | xargs -n 1 uv add
519522
```
520523

521524
- Adds dependencies to your pyproject.toml

scrapsheet.qmd

Lines changed: 0 additions & 235 deletions
Original file line numberDiff line numberDiff line change
@@ -524,241 +524,6 @@ title: "Scrapsheet"
524524
- our approach averages not only different coherent forecasts, but also across hierarchies with completely different middle level series. This is possible since only coherent bottom and top level forecasts are averaged and evaluated.
525525
- Section 2 describes the trace minimization reconciliation method (min T from {forecast})
526526

527-
## tidycensus 3
528-
529-
- rdeck
530-
531-
- visualize large amounts of data
532-
533-
- migration flows
534-
535-
- tidycensus::get_flows
536-
537-
- only for \> 2020 5-yr ACS
538-
539-
- map type also good for mapping commuting patterns
540-
541-
- Automated Mapping
542-
543-
- Memory intensive
544-
545-
- https://walker-data.com/posts/iterative-mapping/ for more advanced Metro exampleThank
546-
547-
- Shows how to export too
548-
549-
- geographic patterns in remote work for 100 largest counties by population in US. (2:17)
550-
551-
- important for office space real estate
552-
553-
- Maps per County
554-
555-
- Generate list of 100 largest counties
556-
557-
``` r
558-
library(tidycensus)
559-
library(tidyverse)
560-
library(mapview)
561-
562-
top100counties <- get_acs(
563-
geography = "county",
564-
variables = "B01003_001",
565-
year = 2022,
566-
survey = "acs1"
567-
) %>%
568-
slice_max(estimate, n = 100)
569-
```
570-
571-
- MOE is NA which means this is true value
572-
573-
- Plus ACS more recent
574-
575-
- pull remote work data at county level for those counties
576-
577-
- Need to get tract data for remote work data
578-
579-
``` r
580-
581-
wfh_tract_list <- top100counties %>%
582-
split(~NAME) %>% # splits into a list with each element per county
583-
map(function(county) {
584-
state_fips <- str_sub(county$GEOID, 1, 2) # extract first 2 chars (state)
585-
county_fips <- str_sub(county$GEOID, 3, 5) # extract next 3 chars (county)
586-
587-
get_acs(
588-
geography = "tract",
589-
variables = "DP03_0024P",
590-
state = state_fips,
591-
county = county_fips,
592-
year = 2022,
593-
geometry = TRUE
594-
)
595-
})
596-
```
597-
598-
- need census key since hitting api 100s of times
599-
600-
- Make 100 Maps
601-
602-
``` r
603-
wfh_maps <-
604-
map(wfh_tract_list, function(county) {
605-
mapview(
606-
county,
607-
zcol = "estimate",
608-
layer.name = "% working from home"
609-
)
610-
})
611-
```
612-
613-
- Small Area Time Series Analysis (2:40)
614-
615-
- Where has remote work increased the most in Salt Lake City, Utah
616-
617-
- 5yr acs represent overlapping samples
618-
619-
- For 2018-2022
620-
621-
- Compare 2008-2012 to
622-
623-
- Comparison Profile only at county level
624-
625-
``` r
626-
utah_wfh_compare <- get_acs(
627-
geography = "county",
628-
variables = c(
629-
work_from_home17 = "CP03_2017_024",
630-
work_from_home22 = "CP03_2022_024"
631-
),
632-
state = "UT",
633-
year = 2022
634-
)
635-
```
636-
637-
- Census Tract (neighborhood-level)
638-
639-
- Issue: geographies change
640-
641-
- get more details
642-
643-
- Areal Interpolation (see [book](https://walker-data.com/census-r/spatial-analysis-with-us-census-data.html?q=small#small-area-time-series-analysis)for more details)
644-
645-
- Interpolating data between sets of boundaries involves the use of weights to re-distribute data from one geography to another
646-
647-
- Check for incongruent boundaries
648-
649-
``` r
650-
library(sf)
651-
652-
wfh_17 <-
653-
get_acs(geography = "tract",
654-
variables = "B08006_017",
655-
year = 2017,
656-
state = "UT",
657-
county = "Salt Lake",
658-
geometry = TRUE) |>
659-
st_transform(6620)
660-
661-
wfh_22 <-
662-
get_acs(geography = "tract",
663-
variables = "B08006_017",
664-
year = 2022,
665-
state = "UT",
666-
county = "Salt Lake",
667-
geometry = TRUE) |>
668-
st_transform(6620)
669-
```
670-
671-
- The process is quicker on a projected coordinated system
672-
673-
- [EPSG:6620](https://epsg.io/6620) is NAD83(2011) / Utah North
674-
675-
- Area-Weighted Areal Interpolation
676-
677-
``` r
678-
library(sf)
679-
library(mapview)
680-
library(leafsync)
681-
682-
wfh_22_to_17 <- wfh_22 |>
683-
select(estimate) |>
684-
st_interpolate_aw(to = wfh_17, extensive = TRUE)
685-
686-
m22a <- mapview(wfh_22, zcol = "estimate", layer.name = "2020 geographies")
687-
m17a <- mapview(wfh_22_to_17, zcol = "estimate", layer.name = "2015 geographies")
688-
689-
sync(m22a, m17a)
690-
```
691-
692-
- **Area-Weighted Interpolation** allocates information from one geography to another geography by weights based on the area of overlap ([Walker, Ch. 7.3.1](https://walker-data.com/census-r/spatial-analysis-with-us-census-data.html?q=small#area-weighted-areal-interpolation))
693-
- Typically more accurate when going *backward*, as many new tracts will “roll up” within parent tracts from a previous Census (though not always)(aka rolls backwards)
694-
- The book has an example that rolls *forwards* from 2015 to 2020.
695-
- Beware: This may be very inaccurate as assumes that population is evenly distributed over area. It can incorrectly allocate large values to low-density / empty areas.
696-
- Better to use Population-Weighted Areal Interpolation
697-
- [extensive = TRUE]{.arg-text} says weighted sums will be computed. Alternatively, if [extensive = FALSE]{.arg-text}, the function returns weighted means.
698-
699-
- Population-Weighted Areal Interpolation
700-
701-
``` r
702-
library(tigris)
703-
options(tigris_use_cache = TRUE)
704-
705-
salt_lake_blocks <-
706-
tigris::blocks(
707-
"UT",
708-
"Salt Lake",
709-
year = 2020
710-
)
711-
712-
wfh_17_to_22 <-
713-
tidycensus::interpolate_pw(
714-
from = wfh_17,
715-
to = wfh_22,
716-
to_id = "GEOID",
717-
weights = salt_lake_blocks,
718-
weight_column = "POP20",
719-
crs = 6620,
720-
extensive = TRUE
721-
)
722-
723-
# check result
724-
# m17b <-
725-
# mapview(wfh_17,
726-
# zcol = "estimate",
727-
# layer.name = "2017 geographies")
728-
# m22b <-
729-
# mapview(wfh_17_to_22,
730-
# zcol = "estimate",
731-
# layer.name = "2022 geographies")
732-
#
733-
# sync(m17b, m22b)
734-
735-
# calculate change over time
736-
wfh_shift <- wfh_17_to_22 %>%
737-
select(GEOID, estimate17 = estimate) %>%
738-
left_join(
739-
select(st_drop_geometry(wfh_22),
740-
GEOID,
741-
estimate22 = estimate),
742-
by = "GEOID"
743-
) |>
744-
mutate(
745-
shift = estimate22 - estimate17,
746-
pct_shift = 100 * (shift / estimate17)
747-
)
748-
749-
mapview(wfh_shift, zcol = "shift")
750-
```
751-
752-
- **Population-Weighted Interpolation** uses an underlying dataset that explains the population distribution as weights.
753-
754-
- Recommended to use census block level data to create the weights. ACS only has geographies down to the Block Group level, so the Dicennial Census values are used.
755-
756-
- `blocks` gets the 2020 Dicennial population values at the census block level to calculate the weights
757-
758-
- `interpolate_pw` creates weights based on the 2020 census block populations. Then, it splits the 2017 weighted data into 2022 geographies.
759-
760-
- The 2022 data is joined to the new 2017 data and percent-change can now be calculated since both have 2022 geometries.
761-
762527
## lab 91
763528

764529
- clvtools for prob type, h2o::automl for ML

0 commit comments

Comments
 (0)