424 refactor resstock data fetch and prep#426
Conversation
There was a problem hiding this comment.
It looks like assign_utility_ny.py contains a lot of function definitions that will be recycled for new states. move generic functions to assign_utility.py and import them into state specific generation.
The intent of the state specifc files would then be to do any one-off data transformations needed before passing the formatted shape files to our established pipeline
There was a problem hiding this comment.
Overall this is a great start. The main restructuring I suggest is a clear separation of state specific settings from our main data pipeline.
The issue is that state-specific behavior leaks into main.py in ways that require editing main.py itself when you onboard a new state.
Specifically:
_assign_utilityhas state dispatch logic and state-specific data loading inline
https://github.com/switchbox-data/rate-design-platform/blob/HEAD/data/resstock/main.py#L459-L511
The RI branch is 1 line. The NY branch is ~50 lines of polygon loading, pygris calls, andCONFIGSwiring. Adding CT means adding anotherelifwith its own inline data loading.main.pygrows with every state.- NY-specific CLI args at the top level of
main.py
--ny-electric-poly-filename,--ny-gas-poly-filename,--path-s3-gis-dirare NY implementation details surfaced as pipeline-level arguments. CT would need--ct-electric-poly-filename, etc.
3.--add-vulnerability-columnsdefault ofTrueis a NY assumption
The Justfile knows RI needs--add-vulnerability-columns False. Butmain.pydefaults toTrue
Proposed design: state config YAML + uniform module interface
Move per-state decisions into a config file that main.py reads, and give utility assignment a uniform interface that main.py dispatches to without knowing state internals.
A. Per-state config: data/resstock/states/<state>.yaml
Each onboarded state gets a small YAML file:
# data/resstock/states/ny.yaml
state: NY
add_vulnerability_columns: true
utility_assignment:
module: data.resstock.assign_utility_ny
kwargs:
s3_gis_dir: s3://data.sb/gis/utility_boundaries/
electric_poly_filename: ny_electric_utilities_20260309.csv
gas_poly_filename: ny_gas_utilities_20260309.csv
puma_year: 2019B. Uniform utility assignment interface
Each state module exports a single function with a uniform signature:
def assign_utility(metadata: pl.LazyFrame, **kwargs) -> pl.LazyFrame:
"""Returns LazyFrame with bldg_id, sb.electric_utility, sb.gas_utility."""
...
# data/resstock/assign_utility_ny.py
def assign_utility(metadata: pl.LazyFrame, **kwargs) -> pl.LazyFrame:
"""kwargs: s3_gis_dir, electric_poly_filename, gas_poly_filename, puma_year"""
# All the polygon loading, pygris calls, CONFIGS wiring lives HERE,
# not in main.py
main.py dispatch becomes generic
def _assign_utility(*, states, path_sb, state_configs, ...):
for s in states:
cfg = state_configs[s]
ua_cfg = cfg.get("utility_assignment")
if ua_cfg is None:
print(f" No utility assignment configured for {s}, skipping.")
continue
mod = importlib.import_module(ua_cfg["module"])
result = mod.assign_utility(metadata, **ua_cfg.get("kwargs", {}))
# ... write result, upload ...
What would need to change
- Create
data/resstock/states/ny.yamlandri.yaml— extract from current hardcoded values (small) - Add state config loading to
main.py— read YAMLs, merge with CLI args; ~30 lines inmain()to load and validate (small) - Give
assign_utility_ny.pya uniformassign_utility(metadata, **kwargs)entry point — move polygon/pygris loading out ofmain.pyinto the module; logic exists atmain.py:462–507(medium) - Give
assign_utility_ri.pythe sameassign_utility(metadata, **kwargs)— trivial wrapper around existingassign_utility_ri(tiny) - Remove
--ny-electric-poly-filename,--ny-gas-poly-filename,--path-s3-gis-dirfrommain.pyCLI — delete args, remove from_assign_utility(small) - Remove
if s == "RI"/elif s == "NY"from_assign_utility— replace withimportlib.import_moduledispatch (small) - Move
--add-vulnerability-columnsdefault to state config — read fromstates/<s>.yaml, CLI becomes override-only (small) - Delete
SUPPORTED_UTILITY_STATESfromassign_utility.py— state discovery is now "doesstates/<s>.yamlexist?" (tiny)
What onboarding a new state would look like
| Step | What you create | Who changes |
|---|---|---|
| 1. Write assignment logic | data/resstock/assign_utility_ct.py with assign_utility(metadata, **kwargs) |
New file |
| 2. Create state config | data/resstock/states/ct.yaml with assignment module, kwargs, vulnerability flag |
New file |
| 3. Add Justfile recipe (optional) | create-sb-release-for-upgrade-02-CT one-liner |
Justfile (append) |
| 4. Run | just run-pipeline CT --upgrade-ids 0 2 |
Nothing |
Summary
This PR builds the unified
data/resstock/main.pypipeline for preparingResStock data for CAIRO, and refactors all utility assignment code to be
state-generic and fully modular. It also introduces centralised state
configuration, a new module layout, and a complete guide for onboarding new
states.
Closes #424
data/resstock/main.py— unified pipelinemain.pyis the single entry point for the entire ResStock_sbreleasepreparation workflow. A single invocation covers every step from download
through upload:
Pipeline steps executed in order:
data.resstock.nrel.fetch_resstock_data)state_configs.yaml)data/resstock/utility/assign_utility.py) — runs immediately after metadata so utility columns are available before load work_sbrelease tree to S3Pre-flight validation runs before any data work: unsupported states, missing
polygon filenames, and required upgrade IDs all fail fast with a clear error.
State-specific configuration —
data/resstock/state_configs.yamlAll per-state constants are centralised in a new
state_configs.yamlfile(keyed by 2-letter state code). This replaces hardcoded values scattered
across scripts:
Key design decisions:
add_vulnerability_columns— controls whether PUMS LMI vulnerabilitycolumns are computed for that state; can be overridden per-run via
--add-vulnerability-columns True/False.electric_poly_filename/gas_poly_filename— presence of both keys(even with null values, as for RI) automatically includes the state in
SUPPORTED_UTILITY_STATESat import time. No manual set maintenance.excluded_gas_utilities— small gas utilities excluded from assignment(formerly the hardcoded
SMALL_GAS_UTILITIESconstant).Utility assignment refactor
Module layout
assign_utility.py— central routingassign_utility()is the only function callers need. Logic is a directstate-by-state conditional — no intermediate abstractions:
assign_utility_ny.py— thin NY wrapperassign_utility_ny()now only builds the NY-specific utility name crosswalkand passes it — along with
EXCLUDED_GAS_UTILITIESandstate_crs— to thegeneric
create_hh_utilities()inutils.py. All algorithmic logic lives inthe generic module.
utils.py— state-generic helpersAll reusable GIS logic was extracted here. Any GIS-based state can compose
these without importing from a NY-specific module:
read_csv_to_gdf_from_s3— load a WKT polygon CSV from S3 as a GeoDataFramecalculate_puma_utility_overlap— compute area-weighted PUMA × utility overlapcalculate_utility_probabilities— turn overlap into per-PUMA probability tablescalculate_prior_distributions— building-weighted prior distributionszero_excluded_gas_utilities_and_renormalize— zero excluded utilities, renormalize, nearest-neighbour donor for PUMAs left with zero probabilitycreate_hh_utilities— full GIS assignment pipeline (parameterised byutility_name_map,state_crs,excluded_gas_utilities)sample_utility_per_building,print_comparison_summary,puma_id_series_for_joinNREL scripts relocated
fetch_resstock_data.pyandcopy_resstock_data.pymoved fromdata/resstock/todata/resstock/nrel/to separate NREL-specific fetchlogic from the pipeline orchestration.
Adding a new state
See
context/code/data/ny_utility_assignment_resstock.md § Adding a new statefor the full checklist. In brief:
state_configs.yaml— add a state entry withstate_fips,add_vulnerability_columns,electric_poly_filename,gas_poly_filename(and
state_crs+puma_yearfor GIS states). The state is automaticallyincluded in
SUPPORTED_UTILITY_STATESonce both polygon filename keys arepresent.
assign_utility_{xx}.py— create the state module following the NY(GIS) or RI (rule-based) pattern. GIS states delegate to the generic
create_hh_utilities()inutils.py.assign_utility.py— import the new function and add anif state == "XX":branch.tests/test_assign_utility_{xx}.py.Reviewer focus
main.pystep ordering — utility assignment (2b) runs before load curvesteps (2c, 2d) so utility columns are available to downstream work.
Pre-flight validation is separated from per-step logic.
state_configs.yamlas single source of truth — any new per-stateconstant should go here, not be hardcoded in a script.
utils.pygeneric helpers — designed to be reused by any futureGIS-based state;
create_hh_utilitieswithutility_name_mapandexcluded_gas_utilitiesas explicit params are the intended extension points.