cipi >> paper, recs; me-gen >> {jlme}; py-gen >> req.txt to uv lock

ercbk · ercbk · commit b0ec1cd876b8 · 2025-03-10T12:24:50.000-04:00
diff --git a/qmd/confidence-and-prediction-intervals.qmd b/qmd/confidence-and-prediction-intervals.qmd
@@ -187,6 +187,12 @@
         -   [article](https://eranraviv.com/bootstrap-standard-error-estimates-good-news/)
         -   bootstrap is "based on a weak convergence of moments"
         -   if you use an estimate based standard deviation of the bootstrap, you are being overly conservative (i.e. overestimate the sd)
+    -   Recommendations ([Paper](https://journals.sagepub.com/doi/full/10.1177/2515245920911881))
+        -   The percentile bootstrap works well when making inferences about trimmed means, quantiles, or correlation coefficient.
+            -   "However, percentile-bootstrap confidence intervals tend to be inaccurate in some situations because the bootstrap sampling distribution is skewed (asymmetric) and biased (consistently shifted away from the population value in one direction)"
+            -   "To address these problems, two major alternatives to the percentile bootstrap have been suggested: the bootstrap-t and the bias-corrected and accelerated (BCa) bootstrap"
+        -   "bootstrap-t can lead to more accurate confidence intervals for the mean and some trimmed means than the percentile bootstrap does, a percentile bootstrap is recommended for inferences about the 20% trimmed mean"
+        -   The BCa approach can be unsatisfactory for relatively small sample sizes
     -   Packages
         -   [{]{style="color: #990000"}[ebtools::get_boot_ci](https://ercbk.github.io/ebtools/reference/get_boot_ci.html){style="color: #990000"}[}]{style="color: #990000"}
         -   [{]{style="color: #990000"}[workboots](https://markjrieke.github.io/workboots/){style="color: #990000"}[}]{style="color: #990000"} - Bootstrap prediction intervals for arbitrary model types from a tidymodel workflow.
diff --git a/qmd/mixed-effects-general.qmd b/qmd/mixed-effects-general.qmd
@@ -43,6 +43,8 @@ fig-cap-location: top
     -   When you have **variable cluster sizes**, inverse cluster size weights can be specified to ensure that all clusters contribute equally regardless of cluster size which mitigates a loss of power.
 -   [{]{style="color: #990000"}[gpboost](https://github.com/fabsig/GPBoost){style="color: #990000"}[}]{style="color: #990000"} - Models fixed effects with a **boosted tree** and combines random effects with **Gaussian Processes** somehow. Multiple likelihoods available
 -   [{]{style="color: #990000"}[skewlmm](https://github.com/fernandalschumacher/skewlmm){style="color: #990000"}[}]{style="color: #990000"} ([Paper](https://arxiv.org/abs/2002.01040)) - Fits **skew robust** linear mixed models, using scale mixture of skew-normal linear mixed models with possible within-subject dependence structure, using an EM-type algorithm.
+-   [{]{style="color: #990000"}[jlme](https://github.com/yjunechoe/jlme/){style="color: #990000"}[}]{style="color: #990000"} - Fits mixed models in Julia from R using `lmer` and `glmer` syntax
+    -   Supports all kinds of diagnostics and CIs
 :::
 
 -   Mixed Effects Model = Random Effects model = Multilevel model = Hierarchical model
@@ -64,6 +66,11 @@ fig-cap-location: top
     -   [Numerical validation as a critical aspect in bringing R to the Clinical Research](https://www.researchgate.net/publication/345778861_Numerical_validation_as_a_critical_aspect_in_bringing_R_to_the_Clinical_Research)
         -   Slides that show various discrepancies between R output and programs like SAS and SPSS and solutions.
         -   Procedure for adopting packages into core analysis procedures (i.e. popularity, documentation, author activity, etc.)
+    -   [Explaining Fixed Effects: Random Effects Modeling of Time-Series Cross-Sectional and Panel Data](https://www.cambridge.org/core/journals/political-science-research-and-methods/article/explaining-fixed-effects-random-effects-modeling-of-timeseries-crosssectional-and-panel-data/0334A27557D15848549120FE8ECD8D63)
+        -   Describes Heterogeneity Bias in repeated measure / longitudinal models. This estimate bias occurs in because the vary slopes variable is also included in the fixed effects. In some cases (?), it creates correlation between the fixed effect and error terms.
+        -   Solution: Demeaning, which is described in this [{parameters}]{style="color: #990000"} [vignette](https://easystats.github.io/parameters/articles/demean.html), splits the variable into between (group mean) and within (deviation from group mean) types.
+        -   Need to read the paper. The vignette doesn't describe what fitting a mixed effects model without demeaning looks like. I'd like to compare both models.
+        -   [{]{style="color: #990000"}[performance::check_heterogeneity_bias](https://easystats.github.io/performance/reference/check_heterogeneity_bias.html){style="color: #990000"}[}]{style="color: #990000"} - See the vignette for a better example of its usage.
 -   Advantages of a mixed model ($y \sim x + (x \;|\; g)$) vs a linear model with an interaction ($y \sim x \ast g$)
     -   From T.J. Mahr [tweet](https://twitter.com/tjmahr/status/1504124329319096323)
     -   Conceptual: Assumes participant means are drawn from the same latent population
@@ -210,14 +217,14 @@ fig-cap-location: top
     -   **Fixed Effects** provide estimates of mean-differences or slopes.
         -   "Fixed" because they are effects that are constant for each subject/unit
         -   Should include within-subject variables (e.g. random effect slope variables) and between-subject variables (e.g. gender)
-        -   Level One: variables measured at the most frequently occurring observational unit
-            -   i.e. Vary for each repeated measure of a subject and vary between subjects
+        -   *Level One*: Variables measured at the most frequently occurring observational unit
+            -   i.e. Vary for each repeated measure of a subject and vary between subjects (“within”-effect)
                 -   In the dataset, these variables that (for the most part) have different values for each row
                 -   Time-dependent if you have longitudinal data
             -   For a RE model, these are usually the adjustment variables
                 -   e.g. Conditioning on a confounder
-        -   Level Two: variables measured at the observational unit level
-            -   i.e. Constant for each repeated measure of a subject but vary between each subject
+        -   *Level Two*: Variables measured at the observational unit level
+            -   i.e. Constant for each repeated measure of a subject but vary between each subject ("between"-effect)
             -   For a RE model, these are usually the treatment variables or variables of interest
                 -   They should contain the information about the between-subject variation
             -   If a factor variable, it has levels which would not change in replications of the study
@@ -602,7 +609,7 @@ fig-cap-location: top
 
 -   [Example 1]{.ribbon-highlight}: [{lme4}]{style="color: #990000"}
 
-    -   From [Estimating multilevel models for change in R](https://www.alexcernat.com/etimating-multilevel-models-for-change-in-r)
+    -   From [Estimating multilevel models for change in R](https://longitudinalanalysis.com/estimating-and-visualizing-multilevel-models-for-change-in-r/)
     -   [usl]{.var-text} is UK sociological survey data
         -   [logincome]{.var-text} is a logged income variable
         -   [pidp]{.var-text} is the person's id
diff --git a/qmd/python-general.qmd b/qmd/python-general.qmd
@@ -516,6 +516,9 @@
     uv add git+https://github.com/psf/requests
 
     uv remove requests
+
+    # add deps from a req.txt to lock file
+    cat requirements.txt | xargs -n 1 uv add 
     ```
 
     -   Adds dependencies to your pyproject.toml
diff --git a/scrapsheet.qmd b/scrapsheet.qmd
@@ -524,241 +524,6 @@ title: "Scrapsheet"
     -   our approach averages not only different coherent forecasts, but also across hierarchies with completely different middle level series. This is possible since only coherent bottom and top level forecasts are averaged and evaluated.
 -   Section 2 describes the trace minimization reconciliation method (min T from {forecast})
 
-## tidycensus 3
-
--   rdeck
-
-    -   visualize large amounts of data
-
-    -   migration flows
-
-        -   tidycensus::get_flows
-
-        -   only for \> 2020 5-yr ACS
-
-    -   map type also good for mapping commuting patterns
-
--   Automated Mapping
-
-    -   Memory intensive
-
-    -   https://walker-data.com/posts/iterative-mapping/ for more advanced Metro exampleThank
-
-        -   Shows how to export too
-
-    -   geographic patterns in remote work for 100 largest counties by population in US. (2:17)
-
-        -   important for office space real estate
-
-        -   Maps per County
-
-        -   Generate list of 100 largest counties
-
-            ``` r
-            library(tidycensus)
-            library(tidyverse)
-            library(mapview)
-
-            top100counties <- get_acs(
-              geography = "county",
-              variables = "B01003_001",
-              year = 2022,
-              survey = "acs1"
-            ) %>%
-              slice_max(estimate, n = 100)
-            ```
-
-            -   MOE is NA which means this is true value
-
-                -   Plus ACS more recent
-
-        -   pull remote work data at county level for those counties
-
-            -   Need to get tract data for remote work data
-
-                ``` r
-
-                wfh_tract_list <- top100counties %>%
-                 split(~NAME) %>% # splits into a list with each element per county
-                 map(function(county) {
-                   state_fips <- str_sub(county$GEOID, 1, 2) # extract first 2 chars (state)
-                   county_fips <- str_sub(county$GEOID, 3, 5) # extract next 3 chars (county)
-
-                   get_acs(
-                     geography = "tract",
-                     variables = "DP03_0024P",
-                     state = state_fips,
-                     county = county_fips,
-                     year = 2022,
-                     geometry = TRUE
-                   )
-                 })
-                ```
-
-                -   need census key since hitting api 100s of times
-
-        -   Make 100 Maps
-
-            ``` r
-            wfh_maps <- 
-              map(wfh_tract_list, function(county) {
-                mapview(
-                  county, 
-                  zcol = "estimate",
-                  layer.name = "% working from home"
-                ) 
-              })
-            ```
-
--   Small Area Time Series Analysis (2:40)
-
-    -   Where has remote work increased the most in Salt Lake City, Utah
-
-        -   5yr acs represent overlapping samples
-
-        -   For 2018-2022
-
-            -   Compare 2008-2012 to
-
-        -   Comparison Profile only at county level
-
-            ``` r
-            utah_wfh_compare <- get_acs(
-              geography = "county",
-              variables = c(
-                work_from_home17 = "CP03_2017_024",
-                work_from_home22 = "CP03_2022_024"
-              ),
-              state = "UT",
-              year = 2022
-            )
-            ```
-
-        -   Census Tract (neighborhood-level)
-
-            -   Issue: geographies change
-
-                -   get more details
-
-            -   Areal Interpolation (see [book](https://walker-data.com/census-r/spatial-analysis-with-us-census-data.html?q=small#small-area-time-series-analysis)for more details)
-
-                -   Interpolating data between sets of boundaries involves the use of weights to re-distribute data from one geography to another
-
-                -   Check for incongruent boundaries
-
-                    ``` r
-                    library(sf)
-
-                    wfh_17 <- 
-                      get_acs(geography = "tract", 
-                              variables = "B08006_017", 
-                              year = 2017,
-                              state = "UT", 
-                              county = "Salt Lake", 
-                              geometry = TRUE) |> 
-                      st_transform(6620)
-
-                    wfh_22 <- 
-                      get_acs(geography = "tract", 
-                              variables = "B08006_017", 
-                              year = 2022,
-                              state = "UT", 
-                              county = "Salt Lake", 
-                              geometry = TRUE) |> 
-                      st_transform(6620)
-                    ```
-
-                    -   The process is quicker on a projected coordinated system
-
-                        -   [EPSG:6620](https://epsg.io/6620) is NAD83(2011) / Utah North
-
-                -   Area-Weighted Areal Interpolation
-
-                    ``` r
-                    library(sf)
-                    library(mapview)
-                    library(leafsync)
-
-                    wfh_22_to_17 <- wfh_22 |> 
-                      select(estimate) |> 
-                      st_interpolate_aw(to = wfh_17, extensive = TRUE)
-
-                    m22a <- mapview(wfh_22, zcol = "estimate", layer.name = "2020 geographies")
-                    m17a <- mapview(wfh_22_to_17, zcol = "estimate", layer.name = "2015 geographies")
-
-                    sync(m22a, m17a)
-                    ```
-
-                    -   **Area-Weighted Interpolation** allocates information from one geography to another geography by weights based on the area of overlap ([Walker, Ch. 7.3.1](https://walker-data.com/census-r/spatial-analysis-with-us-census-data.html?q=small#area-weighted-areal-interpolation))
-                        -   Typically more accurate when going *backward*, as many new tracts will “roll up” within parent tracts from a previous Census (though not always)(aka rolls backwards)
-                        -   The book has an example that rolls *forwards* from 2015 to 2020.
-                            -   Beware: This may be very inaccurate as assumes that population is evenly distributed over area. It can incorrectly allocate large values to low-density / empty areas.
-                            -   Better to use Population-Weighted Areal Interpolation
-                    -   [extensive = TRUE]{.arg-text} says weighted sums will be computed. Alternatively, if [extensive = FALSE]{.arg-text}, the function returns weighted means.
-
-                -   Population-Weighted Areal Interpolation
-
-                    ``` r
-                    library(tigris)
-                    options(tigris_use_cache = TRUE)
-
-                    salt_lake_blocks <- 
-                      tigris::blocks(
-                        "UT", 
-                        "Salt Lake", 
-                        year = 2020
-                      )
-
-                    wfh_17_to_22 <- 
-                      tidycensus::interpolate_pw(
-                        from = wfh_17,
-                        to = wfh_22,
-                        to_id = "GEOID",
-                        weights = salt_lake_blocks,
-                        weight_column = "POP20",
-                        crs = 6620,
-                        extensive = TRUE
-                      )
-
-                    # check result
-                    # m17b <- 
-                    #   mapview(wfh_17, 
-                    #           zcol = "estimate", 
-                    #           layer.name = "2017 geographies")
-                    # m22b <- 
-                    #   mapview(wfh_17_to_22, 
-                    #           zcol = "estimate", 
-                    #           layer.name = "2022 geographies")
-                    # 
-                    # sync(m17b, m22b)
-
-                    # calculate change over time
-                    wfh_shift <- wfh_17_to_22 %>%
-                      select(GEOID, estimate17 = estimate) %>%
-                      left_join(
-                        select(st_drop_geometry(wfh_22), 
-                               GEOID, 
-                               estimate22 = estimate), 
-                        by = "GEOID"
-                      ) |> 
-                      mutate(
-                        shift = estimate22 - estimate17,
-                        pct_shift = 100 * (shift / estimate17)
-                      )
-
-                    mapview(wfh_shift, zcol = "shift")
-                    ```
-
-                    -   **Population-Weighted Interpolation** uses an underlying dataset that explains the population distribution as weights.
-
-                        -   Recommended to use census block level data to create the weights. ACS only has geographies down to the Block Group level, so the Dicennial Census values are used.
-
-                    -   `blocks` gets the 2020 Dicennial population values at the census block level to calculate the weights
-
-                    -   `interpolate_pw` creates weights based on the 2020 census block populations. Then, it splits the 2017 weighted data into 2022 geographies.
-
-                    -   The 2022 data is joined to the new 2017 data and percent-change can now be calculated since both have 2022 geometries.
-
 ## lab 91
 
 -   clvtools for prob type, h2o::automl for ML