Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iss454 #457

Open
wants to merge 9 commits into
base: version2025
Choose a base branch
from
Open

Iss454 #457

wants to merge 9 commits into from

Conversation

awunderground
Copy link
Member

  1. This PR is for Update transportation trips metric #454 but only successfully updates the scripts for counties.
  2. A description of the content in this pull request.
  • Changes:
    • Consolidates years and metrics (transit ridership and transportation cost) so each calculation happens 1/4th as frequently as with the past data.
    • Adds clearer documentation and diagnostics.
    • Switches the race data to race-ethnicity data. Note: the footnote in the UMI dashboard is incorrect.
    • Switches the data quality flag to be based on unweighted sample size instead of the weighted number of households.
    • Combines output into 4 files instead of 8 files.
    • Removes percentile ranking for transit trips.
    • Updates the metric name to be more accurate.
  • Please focus on the county-level data. Someone else will need to fix the joins in the place data.
  1. Detail on any issues or flags that the metric reviewer/data-team should be aware of.

I think the data quality for the transit trips is poor based on how much the values change between 2015 and 2019.

Copy link
Contributor

@Deckart2 Deckart2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @awunderground,

Thank you very much for the hard and quick work to get those code processed. I left some comments below that I think could be helpful to resolve, but they will not lead to changes in the data for county scale.

Per your guidance when requesting the PR, I did not in detail review the city data, though I left two very quick comments. Given that I haven't reviewed that data, I don't want to merge the city-scale code in. Perhaps, we keep this branch open, and when JP makes his changes (also to this branch) we can re-review that code and then merge it all in.

Also just flagging here I have not reviewed code for the 2022 data because it is yet to be created, but I can be available for that too.

full_join(transportation_county, counties, by = c("year", "state", "county"))

anti_join(transportation_county, counties, by = c("year", "state", "county"))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the commentary helpful but stopifnot() statements even more convincing. Maybe

fully_joined <- full_join(transportation_county, counties, by = c("year", "state", "county"))
stopifnot(nrow(fully_joined) == nrow(counties))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion. I added this.

geom_histogram(binwidth = 5) +
facet_wrap(~ year, nrow = 2) +
labs(
x = "Annual Transit Trips for the Regional Moderate Income Household",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had been using "moderate income" because that was H+T language too, but Claudia suggested we use "low income" in the blog post.


```

Makes sense for most counties to fall in really low transit trip numbers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a legitimate comment / analysis for transportation cost data. The key insight I have is that the distribution looks relatively similar between 2015 and 2019

It would also be helpful to note that the x axis is a share of total household costs spent on transportation.

transportation_county <- transportation_county |>
rename(
count_transit_trips = transit_trips_80ami,
index_transportation_cost = t_80ami
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may go above my level of control, but I don't think index_transportation_cost is a good variable name because I don't think it is really an index. Rather, it seems like it is a share (i.e., the share of annual household income spent on transportation).

I would throw out share_hh_transportation_spending


### Read data

The data from HUD cannot be easily read directly into this program.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is confusing. We don't get data directly or indirectly from HUD

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!


```{r}
transportation_tracts <- transportation_tracts |>
rename (GEOID = tract) |>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

acs_tracts <- acs_tracts |>
rename(
total_population = B03002_001E,
non_hispanic = B03002_002E,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you actually use this but that is not a problem

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non_hispanic? I think you are correct. I just pulled that because I was running some checks about the overlapping/non-overlapping hierarchy in the categories in the data.

@awunderground awunderground requested review from jwalsh28 and Deckart2 and removed request for Deckart2 March 4, 2025 15:38
Copy link
Collaborator

@jwalsh28 jwalsh28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new method is looking good Aaron!

Two big requests:

  1. Move the final evaluation form to the correct folder (link in comments) and update the file path in the evaluation form
  2. Consider how change in the trip count variable between years can be baked into the quality variable. I lay out a potential method in the comments. The large variations between years concern me.

@@ -0,0 +1,9 @@
,This form to be filled in for the data in the subgroup files. If the metric has multiple variables please include input for each variable in the file.,,,,,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We updated the destination for these files to https://github.com/UI-Research/mobility-from-poverty/tree/version2025/10a_final-evaluation. Can you move this file there and update the path in your evaluation functions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

relevant variables and years

```{r}
transport_county_2015 <- read_csv(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a priority for this round but note for future update that we should make this read_csv a function and have the list of years be the only item that is updated each round.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good note. Added!


```

We transform transportation cost from an unlabeled percentage to a proportion.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a note on what t_80ami is? Could help users better understand the transofrmation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!


```

The transit trips index is very noisy and tough to interpret. We topcode values at 1,095 (365 *3), divide the range into 100 bins, and assign values to those bins (using a linear transformation and rounding). Percentile ranking did not work well because there were many ties and it obfuscated the distribution of the variable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A visual of the raw distribution would be helpful to have

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a bit of explenation on how the topcode value was chosen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added and added!


```{r}
transportation_county <- transportation_county |>
mutate(count_transit_trips = pmin(1065, count_transit_trips)) |>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above you say you are topcoding values at 1,095 (which is the product of 365 and 3 per your note) but this code topcodes at 1,065.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops! Fixed!

```{r}
transportation_county <- transportation_county |>
mutate(count_transit_trips = pmin(1065, count_transit_trips)) |>
mutate(score_transit_trips = round((count_transit_trips / 1065) * 100))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here (1,065 instead of 1,095)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops! Fixed!

geom_point(alpha = 0.1) +
facet_wrap(~ size) +
coord_equal() +
labs(subtitle = "Large counties have at least 200,000 households")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great viz, I am not sure what to make of these large counties that show such drastic changes in transit trips between 2015 and 2022. Looking at one of the outlier cases, the data for DC shows 22 trips for transit_trips_80ami in 2015, 1150 in 2019 and then back to 301 in 2022.

Run the evaluation function.

```{r}
evaluate_final_data(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See note from above, update file path after moving the evaluation form to the 10b folder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


```

## Data Quality Marker
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that we do not consider extreme variations in the count transit trips reported for certain counties in the quality varaible. A loose concept I have for baking this into quality is as follows:

  • Take the product of count_transit_trips and households variables to create a new variable estimated_total_trips
  • Calculate the delta in this variable from the year prior (for the oldest year it will be the year following)
  • If the change exceeds X percent we should give a quality of 3 for that year (not sure how to select X criteria)

I want to include population because smaller counties may be more prone to large shifts in the average but this should be smaller for estimated total trips. Happy to brainstorm this further with you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your thoughtfulness. I share all of your concerns. This is a really interesting idea.

  • The values can change a lot for natural reasons. For example, there is a big drop everywhere between 2019 and 2022 because of changes in commuting patterns.
  • "Calculate the delta in this variable from the year prior (for the oldest year it will be the year following)" -- Can you unpack this? What do I do for 2015?

@awunderground awunderground requested a review from jwalsh28 March 5, 2025 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants