Skip to content

424 refactor resstock data fetch and prep#426

Merged
alexhyunminlee merged 31 commits into
mainfrom
424-refactor-resstock-data-fetch-and-prep
Jun 4, 2026
Merged

424 refactor resstock data fetch and prep#426
alexhyunminlee merged 31 commits into
mainfrom
424-refactor-resstock-data-fetch-and-prep

Conversation

@alexhyunminlee

@alexhyunminlee alexhyunminlee commented May 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR builds the unified data/resstock/main.py pipeline for preparing
ResStock data for CAIRO, and refactors all utility assignment code to be
state-generic and fully modular. It also introduces centralised state
configuration, a new module layout, and a complete guide for onboarding new
states.

Closes #424


data/resstock/main.py — unified pipeline

main.py is the single entry point for the entire ResStock _sb release
preparation workflow. A single invocation covers every step from download
through upload:

uv run python -m data.resstock.main \
  --state NY RI \
  --release res_2024_amy2018_2 \
  --upgrade-ids 0 2 \
  --path-s3-base s3://data.sb/nrel/resstock/ \
  --path-s3-gis-dir s3://data.sb/switchbox/gis/utility_polygons/ \
  --path-local-base /local/resstock/

Pipeline steps executed in order:

Step Description
1 Fetch raw ResStock parquet files from NREL S3 (data.resstock.nrel.fetch_resstock_data)
2a Modify metadata: identify HP customers, heating type, natgas connection, add LMI vulnerability columns (PUMS, state-specific default via state_configs.yaml)
2b Assign electric and gas utilities (data/resstock/utility/assign_utility.py) — runs immediately after metadata so utility columns are available before load work
2c-i Approximate non-HP load for MF high-rise buildings
2c-ii Adjust MF electricity
2d Add monthly load aggregations
3 Upload the complete _sb release tree to S3

Pre-flight validation runs before any data work: unsupported states, missing
polygon filenames, and required upgrade IDs all fail fast with a clear error.


State-specific configuration — data/resstock/state_configs.yaml

All per-state constants are centralised in a new state_configs.yaml file
(keyed by 2-letter state code). This replaces hardcoded values scattered
across scripts:

NY:
  state_fips: "36"
  add_vulnerability_columns: true
  state_crs: 2260
  puma_year: 2019
  electric_poly_filename: ny_electric_utilities_20260309.csv
  gas_poly_filename: ny_gas_utilities_20260309.csv
  excluded_gas_utilities:
    - bath
    - chautauqua
    - corning
    - fillmore
    - reserve
    - stlaw
RI:
  state_fips: "44"
  add_vulnerability_columns: false
  electric_poly_filename:
  gas_poly_filename:

Key design decisions:

  • add_vulnerability_columns — controls whether PUMS LMI vulnerability
    columns are computed for that state; can be overridden per-run via
    --add-vulnerability-columns True/False.
  • electric_poly_filename / gas_poly_filename — presence of both keys
    (even with null values, as for RI) automatically includes the state in
    SUPPORTED_UTILITY_STATES at import time. No manual set maintenance.
  • excluded_gas_utilities — small gas utilities excluded from assignment
    (formerly the hardcoded SMALL_GAS_UTILITIES constant).

Utility assignment refactor

Module layout

data/resstock/utility/
  assign_utility.py      # Central facade — routes to state implementations
  assign_utility_ny.py   # NY: thin wrapper, passes NY config to generic helpers
  assign_utility_ri.py   # RI: deterministic rule-based assignment
  utils.py               # State-generic GIS helpers (used by all GIS states)

assign_utility.py — central routing

assign_utility() is the only function callers need. Logic is a direct
state-by-state conditional — no intermediate abstractions:

if state == "RI":
    return assign_utility_ri(metadata)

if state == "NY":
    # validate inputs, load polygon CSVs from S3, fetch PUMAs via pygris
    return assign_utility_ny(...)

raise ValueError(...)  # catches states added to SUPPORTED_UTILITY_STATES without a branch

assign_utility_ny.py — thin NY wrapper

assign_utility_ny() now only builds the NY-specific utility name crosswalk
and passes it — along with EXCLUDED_GAS_UTILITIES and state_crs — to the
generic create_hh_utilities() in utils.py. All algorithmic logic lives in
the generic module.

utils.py — state-generic helpers

All reusable GIS logic was extracted here. Any GIS-based state can compose
these without importing from a NY-specific module:

  • read_csv_to_gdf_from_s3 — load a WKT polygon CSV from S3 as a GeoDataFrame
  • calculate_puma_utility_overlap — compute area-weighted PUMA × utility overlap
  • calculate_utility_probabilities — turn overlap into per-PUMA probability tables
  • calculate_prior_distributions — building-weighted prior distributions
  • zero_excluded_gas_utilities_and_renormalize — zero excluded utilities, renormalize, nearest-neighbour donor for PUMAs left with zero probability
  • create_hh_utilities — full GIS assignment pipeline (parameterised by utility_name_map, state_crs, excluded_gas_utilities)
  • sample_utility_per_building, print_comparison_summary, puma_id_series_for_join

NREL scripts relocated

fetch_resstock_data.py and copy_resstock_data.py moved from
data/resstock/ to data/resstock/nrel/ to separate NREL-specific fetch
logic from the pipeline orchestration.


Adding a new state

See context/code/data/ny_utility_assignment_resstock.md § Adding a new state
for the full checklist. In brief:

  1. state_configs.yaml — add a state entry with state_fips,
    add_vulnerability_columns, electric_poly_filename, gas_poly_filename
    (and state_crs + puma_year for GIS states). The state is automatically
    included in SUPPORTED_UTILITY_STATES once both polygon filename keys are
    present.
  2. assign_utility_{xx}.py — create the state module following the NY
    (GIS) or RI (rule-based) pattern. GIS states delegate to the generic
    create_hh_utilities() in utils.py.
  3. assign_utility.py — import the new function and add an
    if state == "XX": branch.
  4. Tests — add tests/test_assign_utility_{xx}.py.

Reviewer focus

  • main.py step ordering — utility assignment (2b) runs before load curve
    steps (2c, 2d) so utility columns are available to downstream work.
    Pre-flight validation is separated from per-step logic.
  • state_configs.yaml as single source of truth — any new per-state
    constant should go here, not be hardcoded in a script.
  • utils.py generic helpers — designed to be reused by any future
    GIS-based state; create_hh_utilities with utility_name_map and
    excluded_gas_utilities as explicit params are the intended extension points.

@alexhyunminlee alexhyunminlee linked an issue May 15, 2026 that may be closed by this pull request
@alxsmith alxsmith self-requested a review May 21, 2026 20:06

@alxsmith alxsmith left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like assign_utility_ny.py contains a lot of function definitions that will be recycled for new states. move generic functions to assign_utility.py and import them into state specific generation.

The intent of the state specifc files would then be to do any one-off data transformations needed before passing the formatted shape files to our established pipeline

Comment thread data/resstock/main.py Outdated
Comment thread data/resstock/main.py Outdated
Comment thread data/resstock/main.py Outdated
Comment thread data/resstock/main.py Outdated

@alxsmith alxsmith left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a great start. The main restructuring I suggest is a clear separation of state specific settings from our main data pipeline.

The issue is that state-specific behavior leaks into main.py in ways that require editing main.py itself when you onboard a new state.
Specifically:

  1. _assign_utility has state dispatch logic and state-specific data loading inline
    https://github.com/switchbox-data/rate-design-platform/blob/HEAD/data/resstock/main.py#L459-L511
    The RI branch is 1 line. The NY branch is ~50 lines of polygon loading, pygris calls, and CONFIGS wiring. Adding CT means adding another elif with its own inline data loading. main.py grows with every state.
  2. NY-specific CLI args at the top level of main.py
    --ny-electric-poly-filename, --ny-gas-poly-filename, --path-s3-gis-dir are NY implementation details surfaced as pipeline-level arguments. CT would need --ct-electric-poly-filename, etc.
    3. --add-vulnerability-columns default of True is a NY assumption
    The Justfile knows RI needs --add-vulnerability-columns False. But main.py defaults to True

Proposed design: state config YAML + uniform module interface

Move per-state decisions into a config file that main.py reads, and give utility assignment a uniform interface that main.py dispatches to without knowing state internals.

A. Per-state config: data/resstock/states/<state>.yaml

Each onboarded state gets a small YAML file:

# data/resstock/states/ny.yaml
state: NY
add_vulnerability_columns: true
utility_assignment:
  module: data.resstock.assign_utility_ny
  kwargs:
    s3_gis_dir: s3://data.sb/gis/utility_boundaries/
    electric_poly_filename: ny_electric_utilities_20260309.csv
    gas_poly_filename: ny_gas_utilities_20260309.csv
    puma_year: 2019

B. Uniform utility assignment interface
Each state module exports a single function with a uniform signature:

def assign_utility(metadata: pl.LazyFrame, **kwargs) -> pl.LazyFrame:
    """Returns LazyFrame with bldg_id, sb.electric_utility, sb.gas_utility."""
    ...

# data/resstock/assign_utility_ny.py
def assign_utility(metadata: pl.LazyFrame, **kwargs) -> pl.LazyFrame:
    """kwargs: s3_gis_dir, electric_poly_filename, gas_poly_filename, puma_year"""
    # All the polygon loading, pygris calls, CONFIGS wiring lives HERE,
    # not in main.py

main.py dispatch becomes generic

def _assign_utility(*, states, path_sb, state_configs, ...):
    for s in states:
        cfg = state_configs[s]
        ua_cfg = cfg.get("utility_assignment")
        if ua_cfg is None:
            print(f"  No utility assignment configured for {s}, skipping.")
            continue
        mod = importlib.import_module(ua_cfg["module"])
        result = mod.assign_utility(metadata, **ua_cfg.get("kwargs", {}))
        # ... write result, upload ...

What would need to change

  • Create data/resstock/states/ny.yaml and ri.yaml — extract from current hardcoded values (small)
  • Add state config loading to main.py — read YAMLs, merge with CLI args; ~30 lines in main() to load and validate (small)
  • Give assign_utility_ny.py a uniform assign_utility(metadata, **kwargs) entry point — move polygon/pygris loading out of main.py into the module; logic exists at main.py:462–507 (medium)
  • Give assign_utility_ri.py the same assign_utility(metadata, **kwargs) — trivial wrapper around existing assign_utility_ri (tiny)
  • Remove --ny-electric-poly-filename, --ny-gas-poly-filename, --path-s3-gis-dir from main.py CLI — delete args, remove from _assign_utility (small)
  • Remove if s == "RI" / elif s == "NY" from _assign_utility — replace with importlib.import_module dispatch (small)
  • Move --add-vulnerability-columns default to state config — read from states/<s>.yaml, CLI becomes override-only (small)
  • Delete SUPPORTED_UTILITY_STATES from assign_utility.py — state discovery is now "does states/<s>.yaml exist?" (tiny)

What onboarding a new state would look like

Step What you create Who changes
1. Write assignment logic data/resstock/assign_utility_ct.py with assign_utility(metadata, **kwargs) New file
2. Create state config data/resstock/states/ct.yaml with assignment module, kwargs, vulnerability flag New file
3. Add Justfile recipe (optional) create-sb-release-for-upgrade-02-CT one-liner Justfile (append)
4. Run just run-pipeline CT --upgrade-ids 0 2 Nothing

@alexhyunminlee alexhyunminlee requested a review from alxsmith June 3, 2026 19:03
Comment thread data/resstock/utility/assign_utility_ny.py Outdated
Comment thread data/resstock/main.py Outdated
Comment thread data/resstock/utility/assign_utility_ny.py Outdated
Comment thread data/resstock/validations.py Outdated
@alexhyunminlee alexhyunminlee merged commit 5ab61b5 into main Jun 4, 2026
2 checks passed
@alexhyunminlee alexhyunminlee deleted the 424-refactor-resstock-data-fetch-and-prep branch June 4, 2026 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor ResStock data fetch and prep

2 participants