-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iss454 #457
base: version2025
Are you sure you want to change the base?
Iss454 #457
Conversation
…les for transit trips.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @awunderground,
Thank you very much for the hard and quick work to get those code processed. I left some comments below that I think could be helpful to resolve, but they will not lead to changes in the data for county scale.
Per your guidance when requesting the PR, I did not in detail review the city data, though I left two very quick comments. Given that I haven't reviewed that data, I don't want to merge the city-scale code in. Perhaps, we keep this branch open, and when JP makes his changes (also to this branch) we can re-review that code and then merge it all in.
Also just flagging here I have not reviewed code for the 2022 data because it is yet to be created, but I can be available for that too.
full_join(transportation_county, counties, by = c("year", "state", "county")) | ||
|
||
anti_join(transportation_county, counties, by = c("year", "state", "county")) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the commentary helpful but stopifnot()
statements even more convincing. Maybe
fully_joined <- full_join(transportation_county, counties, by = c("year", "state", "county"))
stopifnot(nrow(fully_joined) == nrow(counties))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion. I added this.
geom_histogram(binwidth = 5) + | ||
facet_wrap(~ year, nrow = 2) + | ||
labs( | ||
x = "Annual Transit Trips for the Regional Moderate Income Household", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had been using "moderate income" because that was H+T language too, but Claudia suggested we use "low income" in the blog post.
|
||
``` | ||
|
||
Makes sense for most counties to fall in really low transit trip numbers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a legitimate comment / analysis for transportation cost data. The key insight I have is that the distribution looks relatively similar between 2015 and 2019
It would also be helpful to note that the x axis is a share of total household costs spent on transportation.
transportation_county <- transportation_county |> | ||
rename( | ||
count_transit_trips = transit_trips_80ami, | ||
index_transportation_cost = t_80ami |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may go above my level of control, but I don't think index_transportation_cost is a good variable name because I don't think it is really an index. Rather, it seems like it is a share (i.e., the share of annual household income spent on transportation).
I would throw out share_hh_transportation_spending
|
||
### Read data | ||
|
||
The data from HUD cannot be easily read directly into this program. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is confusing. We don't get data directly or indirectly from HUD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed!
|
||
```{r} | ||
transportation_tracts <- transportation_tracts |> | ||
rename (GEOID = tract) |> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
acs_tracts <- acs_tracts |> | ||
rename( | ||
total_population = B03002_001E, | ||
non_hispanic = B03002_002E, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you actually use this but that is not a problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non_hispanic
? I think you are correct. I just pulled that because I was running some checks about the overlapping/non-overlapping hierarchy in the categories in the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new method is looking good Aaron!
Two big requests:
- Move the final evaluation form to the correct folder (link in comments) and update the file path in the evaluation form
- Consider how change in the trip count variable between years can be baked into the quality variable. I lay out a potential method in the comments. The large variations between years concern me.
@@ -0,0 +1,9 @@ | |||
,This form to be filled in for the data in the subgroup files. If the metric has multiple variables please include input for each variable in the file.,,,,, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We updated the destination for these files to https://github.com/UI-Research/mobility-from-poverty/tree/version2025/10a_final-evaluation. Can you move this file there and update the path in your evaluation functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
relevant variables and years | ||
|
||
```{r} | ||
transport_county_2015 <- read_csv( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a priority for this round but note for future update that we should make this read_csv a function and have the list of years be the only item that is updated each round.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good note. Added!
|
||
``` | ||
|
||
We transform transportation cost from an unlabeled percentage to a proportion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a note on what t_80ami is? Could help users better understand the transofrmation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added!
|
||
``` | ||
|
||
The transit trips index is very noisy and tough to interpret. We topcode values at 1,095 (365 *3), divide the range into 100 bins, and assign values to those bins (using a linear transformation and rounding). Percentile ranking did not work well because there were many ties and it obfuscated the distribution of the variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A visual of the raw distribution would be helpful to have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a bit of explenation on how the topcode value was chosen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added and added!
|
||
```{r} | ||
transportation_county <- transportation_county |> | ||
mutate(count_transit_trips = pmin(1065, count_transit_trips)) |> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above you say you are topcoding values at 1,095 (which is the product of 365 and 3 per your note) but this code topcodes at 1,065.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops! Fixed!
```{r} | ||
transportation_county <- transportation_county |> | ||
mutate(count_transit_trips = pmin(1065, count_transit_trips)) |> | ||
mutate(score_transit_trips = round((count_transit_trips / 1065) * 100)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here (1,065 instead of 1,095)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops! Fixed!
geom_point(alpha = 0.1) + | ||
facet_wrap(~ size) + | ||
coord_equal() + | ||
labs(subtitle = "Large counties have at least 200,000 households") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great viz, I am not sure what to make of these large counties that show such drastic changes in transit trips between 2015 and 2022. Looking at one of the outlier cases, the data for DC shows 22 trips for transit_trips_80ami in 2015, 1150 in 2019 and then back to 301 in 2022.
Run the evaluation function. | ||
|
||
```{r} | ||
evaluate_final_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See note from above, update file path after moving the evaluation form to the 10b folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
``` | ||
|
||
## Data Quality Marker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that we do not consider extreme variations in the count transit trips reported for certain counties in the quality varaible. A loose concept I have for baking this into quality is as follows:
- Take the product of
count_transit_trips
andhouseholds
variables to create a new variableestimated_total_trips
- Calculate the delta in this variable from the year prior (for the oldest year it will be the year following)
- If the change exceeds X percent we should give a quality of 3 for that year (not sure how to select X criteria)
I want to include population because smaller counties may be more prone to large shifts in the average but this should be smaller for estimated total trips. Happy to brainstorm this further with you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your thoughtfulness. I share all of your concerns. This is a really interesting idea.
- The values can change a lot for natural reasons. For example, there is a big drop everywhere between 2019 and 2022 because of changes in commuting patterns.
- "Calculate the delta in this variable from the year prior (for the oldest year it will be the year following)" -- Can you unpack this? What do I do for 2015?
I think the data quality for the transit trips is poor based on how much the values change between 2015 and 2019.