Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ data/census/pums/zips/
data/census/pums/csv/
data/census/pums/parquet/
data/census/pums/data_dict_cache/
data/resstock/utility/zips/
data/resstock/utility/shapefiles/
data/resstock/utility/csv/
data/resstock/utility/parquet/
data/eia/861/electric_utility_stats/
data/eia/861/parquet/
data/eia/heating_fuel_prices/parquet/
Expand Down
38 changes: 19 additions & 19 deletions context/README.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion context/code/data/resstock_data_preparation_run_order.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ just -f data/resstock/Justfile identify-all-metadata <STATE>
## 3. Add utility assignment (standard release)

Assign electric and gas utilities to buildings in the **standard** release so that downstream steps and the `sb` copy use utility-aware metadata. Run once for upgrade `00`; the assignments should remain constant across upgrades.
For NY-specific details on small gas utilities and nearest-neighbor donor behavior, see `context/code/data/ny_utility_assignment_resstock.md`.
For state-specific details on utility assignment (excluded gas utilities, nearest-neighbor PUMA fill, HIFLD data sources), see `context/code/data/utility_assignment_resstock.md`.

**Via state-specific Justfile:**

Expand Down
2 changes: 1 addition & 1 deletion context/code/data/resstock_sb_release_pipeline_main_py.md
Original file line number Diff line number Diff line change
Expand Up @@ -458,6 +458,6 @@ When `--sample N` is passed (N > 0):

3. **k and include_cooling are hardcoded**: `_approximate_non_hp_load` uses `k=15` and `include_cooling=False`. These should eventually become CLI arguments if they need to vary.

4. **Utility assignment only supports NY and RI**: `SUPPORTED_UTILITY_STATES` is derived dynamically from `data/resstock/state_configs.yaml` — any state whose config entry contains a `utility_assignment` key is included. Adding a new state requires: (a) adding a `utility_assignment` section to the state's entry in `state_configs.yaml` with `module` (and `kwargs` for GIS-based states), and (b) creating the state module with an `assign_utility(metadata, **kwargs)` entry point. No changes to `assign_utility.py` are needed. See `context/code/data/ny_utility_assignment_resstock.md § Adding a new state`.
4. **Utility assignment is state-gated**: `SUPPORTED_UTILITY_STATES` is derived dynamically from `data/resstock/state_configs.yaml` — any state whose config entry contains a `utility_assignment` key is included (currently NY, MD, RI). Adding a new state requires: (a) adding a `utility_assignment` section to the state's entry in `state_configs.yaml` with `module` (and `kwargs` for GIS-based states), and (b) creating the state module with an `assign_utility(metadata, **kwargs)` entry point. No changes to `assign_utility.py` are needed. See `context/code/data/utility_assignment_resstock.md § Adding a new state`.

5. **Monthly loads in sample mode produce N files**: When `--sample N` is active, only N hourly parquets exist locally, so only N monthly parquets are generated. This is expected — sample mode is for development/testing only.
Original file line number Diff line number Diff line change
@@ -1,18 +1,19 @@
# ResStock utility assignment

How electric and gas utilities are assigned to ResStock buildings — generic architecture, NY-specific implementation, and instructions for adding new states.
How electric and gas utilities are assigned to ResStock buildings — generic architecture, state-specific implementations (NY, MD, RI), and instructions for adding new states.

**Use when:** Working on utility assignment for any state, adding a new state, excluded gas utility handling, PUMA–utility overlap, or ResStock metadata columns `sb.electric_utility` / `sb.gas_utility`.
**Use when:** Working on utility assignment for any state, adding a new state, excluded gas utility handling, PUMA–utility overlap, nearest-neighbor PUMA fill, or ResStock metadata columns `sb.electric_utility` / `sb.gas_utility`.

---

## Overview

- **Entrypoint:** `assign_utility_ny()` in `data/resstock/utility/assign_utility_ny.py` (and CLI via `assign_utility_ny.py`). This is a thin wrapper that builds the NY utility-name crosswalk and passes NY-specific configuration (utility name map, excluded gas utilities, state CRS) to the generic `create_hh_utilities()` in `data/resstock/utility/utils.py`.
- **Dispatcher:** `data/resstock/utility/assign_utility.py` — reads `state_configs.yaml`, dynamically imports the state-specific module, and calls its `assign_utility(metadata, **kwargs)` entry point. No `if state == "XX":` branches; adding a config entry and a module is enough.
- **State modules:** `assign_utility_ny.py` (GIS-based, HIFLD electric + gas, name crosswalk, excluded utilities), `assign_utility_md.py` (GIS-based, EIA-861 county polygons for electric + HIFLD for gas, nearest-neighbor PUMA fill for gas), `assign_utility_ri.py` (rule-based, no GIS).
- **Inputs:** ResStock metadata (with `in.puma`, `in.heating_fuel`, `has_natgas_connection`), electric and gas utility service-territory polygons (CSV with WKT), Census PUMAs (pygris).
- **Outputs:** Same metadata with `sb.electric_utility` and `sb.gas_utility` added (or overwritten).
- **Logic:** PUMA–utility overlap → PUMA-level probability tables → per-building sampling (deterministic seed). Electric: every building gets an electric utility. Gas: only buildings with `has_natgas_connection` get a gas utility; others get null.
- **Generic functions:** `create_hh_utilities()`, `zero_excluded_gas_utilities_and_renormalize()`, `calculate_puma_utility_overlap()`, `calculate_utility_probabilities()`, `calculate_prior_distributions()`, `sample_utility_per_building()`, `print_comparison_summary()`, `puma_id_series_for_join()`, `read_csv_to_gdf_from_s3()` all live in `data/resstock/utility/utils.py` and are state-generic.
- **Generic functions:** `create_hh_utilities()`, `zero_excluded_gas_utilities_and_renormalize()`, `fill_missing_puma_probabilities()`, `calculate_puma_utility_overlap()`, `calculate_utility_probabilities()`, `calculate_prior_distributions()`, `sample_utility_per_building()`, `print_comparison_summary()`, `puma_id_series_for_join()`, `read_csv_to_gdf_from_s3()` all live in `data/resstock/utility/utils.py` and are state-generic.

---

Expand Down Expand Up @@ -63,14 +64,14 @@ After zeroing excluded gas utilities (and optionally replacing bad-PUMA rows wit

## Invocation and data flow

**Recommended (via main.py):** Utility assignment runs as step 2b inside `data/resstock/main.py` (`_assign_utility` function), immediately after metadata transforms (step 2a) and before load curve modifications. It operates directly on the `_sb` release -- reads `metadata-sb.parquet` from the `_sb` tree (after all metadata transforms have been applied), routes to `assign_utility()` in `data/resstock/utility/assign_utility.py` which loads state configuration from `state_configs.yaml` internally, and writes `metadata_utility/state=NY/utility_assignment.parquet` into the `_sb` tree on local EBS, then uploads immediately to S3 via `aws s3 cp`. No separate copy step is needed. See `context/code/data/resstock_sb_release_pipeline_main_py.md` for details.
**Recommended (via main.py):** Utility assignment runs as step 2b inside `data/resstock/main.py` (`_assign_utility` function), immediately after metadata transforms (step 2a) and before load curve modifications. It operates directly on the `_sb` release reads `metadata-sb.parquet` from the `_sb` tree (after all metadata transforms have been applied), routes to `assign_utility()` in `data/resstock/utility/assign_utility.py` which loads state configuration from `state_configs.yaml` internally, and writes `metadata_utility/state=<XX>/utility_assignment.parquet` into the `_sb` tree on local EBS, then uploads immediately to S3 via `aws s3 cp`. No separate copy step is needed. See `context/code/data/resstock_sb_release_pipeline_main_py.md` for details.

**State support:** A state is included in `SUPPORTED_UTILITY_STATES` when its entry in `data/resstock/state_configs.yaml` contains a `utility_assignment` key. The `utility_assignment` section specifies a `module` (dotted Python module path) and optional `kwargs` (passed to the module's `assign_utility()` function). GIS-based states store polygon filenames, CRS, and PUMA year under `kwargs`; CLI flags `--electric-poly-filename` / `--gas-poly-filename` / `--path-s3-gis-dir` override or supplement these at runtime. Pre-flight validation (`validate_utility_assignment_args`) checks that all requested states are in `SUPPORTED_UTILITY_STATES` before any data processing begins.

**Legacy (individual Justfile recipe):** `assign-utility-ny` in `data/resstock/Justfile` downloads NY polygons, then calls `assign_utility_ny.py` directly with S3 paths. In the old workflow this ran on the **standard** release (step 3), and the output was brought into `_sb` by the copy step (step 4). These individual recipes are still available for debugging.

- **Run order (legacy):** After `identify-hp-and-heating-type-all-upgrades-and-natgas-connection` (metadata has `has_natgas_connection` and `in.puma`). See `context/code/data/resstock_data_preparation_run_order.md`.
- **Output column file:** `metadata_utility/state=NY/utility_assignment.parquet` -- contains only `bldg_id`, `sb.electric_utility`, `sb.gas_utility`.
- **Output column file:** `metadata_utility/state=<XX>/utility_assignment.parquet` contains only `bldg_id`, `sb.electric_utility`, `sb.gas_utility`.

---

Expand All @@ -85,6 +86,138 @@ After zeroing excluded gas utilities (and optionally replacing bad-PUMA rows wit

---

## Maryland (MD)

MD follows the same GIS-based pattern as NY — PUMA overlap → probability table → per-building sampling — with one structural difference in the electric side: instead of HIFLD utility polygons, MD uses **Census county polygons weighted by EIA Form 861 service territory data**. Gas assignment continues to use HIFLD polygon CSVs, identical to NY.

### Why not HIFLD for electric

HIFLD is missing three of the five major MD investor-owned utilities — Pepco, Potomac Edison, and Delmarva Power — because those utilities never submitted their boundary shapes to the HIFLD portal. The 2024 HIFLD snapshot for MD covers only BGE, SMECO, Choptank, and a few small municipals, representing roughly 40% of MD customers. The other 60% would fall back to the nearest HIFLD polygon (BGE), producing systematically wrong assignments for all of Montgomery County, Prince George's County, the Eastern Shore, and western MD.

PJM does not distribute service territory shapefiles (FERC critical infrastructure policy). The Maryland PSC publishes utility reports but no GIS data. No other federal or state source publishes complete sub-county boundaries for all MD utilities.

The most complete authoritative source with full utility coverage is **EIA Form 861 Schedule 8**, which requires every distribution utility to report the counties it serves. PUDL processes this into `core_eia861__yearly_service_territory`, available via HTTPS at the same S3 bucket as the EIA-861 sales data already used in this pipeline.

### Electric utility assignment: county-weighted PUMA overlap

**Data sources**

- **EIA-861 county service territory:** PUDL `core_eia861__yearly_service_territory.parquet` (PUDL stable release v2026.2.0). Maps each utility to the counties it serves. MD has 24 counties + Baltimore City; the 2023 data has 46 (county, utility) rows — many counties are served by more than one utility.
- **EIA-861 utility stats:** Our existing `s3://data.sb/eia/861/electric_utility_stats/` (year=2023/state=MD). Provides statewide residential customer counts used to weight split counties.
- **Census county polygons:** `pygris.counties(state="MD", year=2019)` — standard TIGER/Line county boundaries, 2019 vintage to match the PUMA year.

**Pre-processing: `data/eia/861/fetch_service_territory.py`**

Run once (or annually) to produce:

```
s3://data.sb/eia/861/service_territory/state=MD/data.parquet
```

Schema: `county_id_fips`, `county`, `utility_id_eia`, `utility_name_eia`, `residential_customers`, `weight`, `report_year`.

- Only distribution utilities with residential customers > 0 are included (retail marketers and power marketers are excluded).
- `weight` normalises `residential_customers` within each county so weights sum to 1.0. For single-utility counties, weight = 1.0. For split counties, each utility gets its share of statewide MD residential customers as a proxy.

Invoked via:

```
just -f data/eia/861/Justfile fetch-service-territory MD
```

**Runtime: `assign_utility_md.py`**

`assign_utility()` calls `calculate_puma_county_utility_overlap()` (in `utils.py`) instead of `calculate_puma_utility_overlap()`. The function:

1. Projects PUMAs and county polygons to `state_crs` (2248).
2. Computes the intersection area of every (PUMA, county) pair via `gpd.overlay(..., how="intersection")`.
3. For each (PUMA, county) pair, fans out to one row per utility serving that county, with `pct_overlap = overlap_area × weight / puma_area × 100`.
4. Groups by (PUMA, utility) and sums — so a PUMA spanning multiple counties accumulates weighted contributions from each.
5. Returns a LazyFrame with `puma_id`, `utility`, `pct_overlap` — identical format to `calculate_puma_utility_overlap`, plugging directly into the existing probability and sampling machinery.

**Granularity and accuracy**

PUMAs that straddle a county line get geographic signal from both sides. A PUMA in western Frederick County (split BGE/Potomac Edison) that also overlaps Washington County (Potomac Edison-only) accumulates extra weight toward Potomac Edison; one in eastern Frederick that overlaps Howard County (BGE-only) tilts toward BGE. For PUMAs entirely inside a split county, the statewide customer-share proxy applies.

Major utilities covered (2023 EIA-861, MD distribution utilities):

| Utility | EIA ID | Std name | Residential customers |
| --------------------------------- | ------ | ---------------- | --------------------- |
| Baltimore Gas & Electric Co | 1167 | `bge` | 1,208k |
| Potomac Electric Power Co (Pepco) | 15270 | `pepco` | 548k |
| The Potomac Edison Company | 15263 | `potomac_edison` | 253k |
| Delmarva Power | 5027 | `delmarva` | 185k |
| Southern Maryland Elec Coop | 17637 | `smeco` | 159k |
| Choptank Electric Cooperative | 3503 | `choptank` | ~30k |
| Somerset Rural Electric Coop | 84 | `somerset_rec` | small |
| Town of Berlin (MD) | 1615 | `berlin_muni` | small |

**EIA utility ID → std name mapping** is defined in `_EIA_ID_TO_STD_NAME` in `assign_utility_md.py`. Utilities not in the map fall back to the EIA name string.

### Gas utility assignment

Unchanged from the original HIFLD-based approach:

- **Source:** HIFLD Open "Natural Gas Service Territories" feature layer (archived at DataLumos). Fetch via `load_utility_boundaries()`.
- **Cached** as dated WKT CSV in `s3://data.sb/gis/utility_boundaries/`; filename in `state_configs.yaml` under `MD.utility_assignment.kwargs.gas_poly_filename`.
- **MD LDCs present** (as of 2024 snapshot):
- Baltimore Gas and Electric Co — 92.9% of natgas-connected buildings
- Columbia Gas of Washington/Maryland — 3.4%
- Sand-Piper Energy — 2.3%
- Easton Utilities — 1.3%
- Elkton Gas Company — 0.1%

### PUMA boundaries

- **Source:** `pygris.pumas(state="MD", year=2019, cb=True)`.
- **Vintage:** 2019 — 2010-definition PUMAs. MD has 44 PUMAs.
- **CRS:** `state_crs: 2248` — NAD83 / Maryland State Plane (feet).
- **Load/cache:** via `load_pumas()` in `utils.py` (local cache → S3 → pygris fallback).

### Nearest-neighbor PUMA fill

County polygons cover all of Maryland's land area by definition, so the electric assignment has no coverage gaps — every PUMA overlaps at least one county. The nearest-neighbor fill (`fill_missing_puma_probabilities`) is still called for electric as a safety net but should produce zero fills in practice.

Gas coverage gaps remain (HIFLD gas boundaries still have the same rural/suburban gaps as before), so the nearest-neighbor fill is genuinely needed for gas, with the same behavior as documented in the previous HIFLD-based approach.

### No excluded gas utilities

`excluded_gas_utilities` is not set for MD. All gas LDCs are assigned; none are zeroed before sampling.

### State_configs.yaml kwargs for MD

```yaml
state_crs: 2248
puma_year: 2019
electric_service_territory_s3_path: "s3://data.sb/eia/861/service_territory/state=MD/data.parquet"
gas_poly_filename: "md_gas_utilities_20260605.csv"
```

### Full MD pipeline (start to finish)

```bash
# 1. Fetch EIA-861 utility stats to S3 (if not already current)
just -f data/eia/861/Justfile update

# 2. Fetch county service territory weights to S3 (electric assignment data)
just -f data/eia/861/Justfile fetch-service-territory MD

# 3. Download ResStock metadata for MD
just s MD fetch-resstock-metadata

# 4. Run utility assignment
just s MD assign-utility

# 5. Upload utility assignment to S3
just s MD upload-utility-assignment
```

### Output

`metadata_utility/state=MD/utility_assignment.parquet` — `bldg_id`, `sb.electric_utility`, `sb.gas_utility`. Written to local EBS and uploaded to `s3://data.sb/nrel/resstock/res_2024_amy2018_2_sb/metadata_utility/state=MD/utility_assignment.parquet`.

---

## Adding a new state

There are two patterns: **GIS-based** (like NY — PUMA overlap + probabilistic sampling) and **rule-based** (like RI — deterministic assignment, no GIS). Follow the checklist for the appropriate pattern.
Expand Down
6 changes: 6 additions & 0 deletions data/eia/861/Justfile
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,12 @@ upload:
fetch-state-stats state:
uv run python "{{ path_local_repo }}/data/eia/861/fetch_electric_utility_stat_parquets.py" {{ state }}

# Fetch county-level service territory for a state and upload to S3.
# Requires EIA-861 utility stats to already be on S3 (run `just update` first).
# Output: s3://data.sb/eia/861/service_territory/state=<STATE>/data.parquet
fetch-service-territory state year="2023":
uv run python "{{ path_local_repo }}/data/eia/861/fetch_service_territory.py" {{ state }} --year {{ year }}

# Remove local parquet/
clean:
rm -rf "{{ path_local_parquet }}"
Expand Down
63 changes: 2 additions & 61 deletions data/eia/861/fetch_electric_utility_stat_parquets.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,68 +39,9 @@

import polars as pl

from data.eia.constants import PUDL_YEARLY_SALES_URL, VALID_STATE_CODES
from utils.utility_codes import get_eia_utility_id_to_std_name

# EIA-861 yearly sales (PUDL Catalyst Coop stable release; see https://github.com/catalyst-cooperative/pudl/releases)
PUDL_STABLE_VERSION = "v2026.2.0"
CORE_EIA861_YEARLY_SALES_URL = f"https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/{PUDL_STABLE_VERSION}/core_eia861__yearly_sales.parquet"

VALID_STATE_CODES = frozenset(
{
"al",
"ak",
"az",
"ar",
"ca",
"co",
"ct",
"de",
"fl",
"ga",
"hi",
"id",
"il",
"in",
"ia",
"ks",
"ky",
"la",
"me",
"md",
"ma",
"mi",
"mn",
"ms",
"mo",
"mt",
"ne",
"nv",
"nh",
"nj",
"nm",
"ny",
"nc",
"nd",
"oh",
"ok",
"or",
"pa",
"ri",
"sc",
"sd",
"tn",
"tx",
"ut",
"vt",
"va",
"wa",
"wv",
"wi",
"wy",
"dc",
}
)

# Fixed order for lazy aggregation and column output; must match dataset (validated in tests).
CUSTOMER_CLASSES_ORDERED = (
"commercial",
Expand Down Expand Up @@ -185,7 +126,7 @@ def _output_columns() -> list[str]:

def _base_lazy() -> pl.LazyFrame:
"""Scan EIA-861 and add report year; no entity-type or latest-date filter."""
return pl.scan_parquet(CORE_EIA861_YEARLY_SALES_URL).with_columns(
return pl.scan_parquet(PUDL_YEARLY_SALES_URL).with_columns(
pl.col("report_date").dt.year().alias("year")
)

Expand Down
Loading
Loading