Explicitly use default datasets for US state-level simulations #2953

DTrim99 · 2025-12-04T21:41:15Z

Note: this summary is now deprecated

Summary

Update _setup_data to pass through hf:// and gs:// URLs directly
Allows frontend to specify state-specific datasets via full URL
Maintains backward compatibility with enhanced_cps keyword

Changes

The _setup_data method now checks if the dataset parameter starts with hf:// or gs:// and passes it through unchanged. This enables the frontend to specify state-specific datasets like:

hf://policyengine/policyengine-us-data/states/CA.h5

Dataset Selection Logic

Input	Output
`hf://...` URL	Pass through unchanged
`gs://...` URL	Pass through unchanged
`enhanced_cps`	`gs://policyengine-us-data/enhanced_cps_2024.h5`
`None` + US state	`gs://policyengine-us-data/pooled_3_year_cps_2023.h5` (fallback)
`None` + US/UK	`None`

Test plan

New tests for HF URL passthrough
New tests for GS URL passthrough
Existing tests still pass
Backward compatible with enhanced_cps keyword

Related PRs

Add state-specific dataset URLs for US state simulations policyengine-app-v2#497 - Frontend changes to send state-specific HF URLs

🤖 Generated with Claude Code

- Update _setup_data to pass through hf:// and gs:// URLs directly - Allows frontend to specify state-specific datasets via full URL - Maintains backward compatibility with enhanced_cps keyword - Fallback to pooled CPS for states when no dataset specified - Add tests for HF URL passthrough behavior This change works with policyengine-app-v2 to enable state-specific datasets at hf://policyengine/policyengine-us-data/states/{STATE}.h5 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

codecov · 2025-12-04T22:07:01Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.87%. Comparing base (81b7179) to head (8532c6f).
⚠️ Report is 12 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #2953   +/-   ##
=======================================
  Coverage   74.87%   74.87%           
=======================================
  Files          53       53           
  Lines        1966     1966           
  Branches      254      254           
=======================================
  Hits         1472     1472           
  Misses        444      444           
  Partials       50       50

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

anth-volk · 2025-12-08T09:27:44Z

@DTrim99 We should chat about this today

anth-volk · 2025-12-09T10:55:37Z

As of PolicyEngine/policyengine.py#191, we now by default use Ben's state datasets in the sim API and .py (if using the maintenance branch below version 1). Instead of moving the dataset specification logic upward all the way to app v2 as part of PolicyEngine/policyengine-app-v2#497, I'd propose actually moving it downward - just have the new .py manage the default dataset and let everything on top just say "please choose the default dataset," unless we really need to override it.

What I'm going to do is hijack this PR just to have it select the default dataset for US states, which will be Ben's datasets. I don't think we should pass through HF or GCS links here, and in fact, we shouldn't be using HFs at all due to their rate limiting policy.

anth-volk · 2025-12-09T11:23:52Z

@PolicyEngine summarize what this PR does now, following a series of changes

policyengine · 2025-12-09T11:24:00Z

Summary

Following the changes made in this PR, here's what it now does:

Core Change: This PR modifies the dataset selection logic for US state-level simulations to return None (letting policyengine.py choose the default dataset) instead of explicitly specifying a dataset path.

Specific Changes:

Dataset Selection Logic (in economy_service.py):
- enhanced_cps → Returns gs://policyengine-us-data/enhanced_cps_2024.h5 (unchanged)
- NYC region → Returns gs://policyengine-us-data/pooled_3_year_cps_2023.h5 (special case added)
- US state-level simulations → Returns None (changed from returning pooled CPS) (from Anthony: this means we use the default as set up in .py; for the US, this now means Enhanced CPS for nationwide simulations and Ben's state-level datasets for state-level sims)
- All other cases → Returns None
Why the change: Per @anth-volk's comment, instead of having the API specify state-specific datasets, the logic is being pushed down to policyengine.py (via PR #191). This way, policyengine.py manages the default dataset selection, and the API just says "use the default" by passing None.
NYC Exception: NYC is a special case that still needs to explicitly use the pooled 3-year CPS dataset rather than any state-specific dataset.

Files Modified:

policyengine_api/services/economy_service.py: Updated _setup_data() method
tests/unit/services/test_economy_service.py: Updated tests to reflect new behavior
changelog_entry.yaml: Added changelog entry

The PR effectively simplifies the dataset selection at the API level by delegating to policyengine.py for default dataset management, except for the NYC special case.

DTrim99 mentioned this pull request Dec 4, 2025

Add state-specific dataset URLs for US state simulations PolicyEngine/policyengine-app-v2#497

Closed

4 tasks

Fix black formatting and add changelog entry

a2c821b

Fix black formatting

0a4071f

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

DTrim99 requested a review from MaxGhenis December 4, 2025 22:12

DTrim99 requested a review from anth-volk December 8, 2025 16:49

anth-volk added 2 commits December 9, 2025 15:11

fix: Use default dataset for US state-level simulations

5efdbef

fix: Ensure that NYC still uses Pooled 3-Year CPS

07582e2

anth-volk changed the title ~~Support HuggingFace dataset URLs in economy service~~ Explicitly use default datasets for US state-level simulations Dec 9, 2025

policyengine bot added the ⚙️ Engineering... label Dec 9, 2025

policyengine bot removed the ⚙️ Engineering... label Dec 9, 2025

test: Update tests

8532c6f

anth-volk approved these changes Dec 9, 2025

View reviewed changes

anth-volk merged commit 9a7bfc0 into master Dec 9, 2025
7 checks passed

anth-volk deleted the feature/state-specific-datasets branch December 9, 2025 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explicitly use default datasets for US state-level simulations #2953

Explicitly use default datasets for US state-level simulations #2953

Uh oh!

DTrim99 commented Dec 4, 2025 •

edited by anth-volk

Loading

Uh oh!

codecov bot commented Dec 4, 2025 •

edited

Loading

Uh oh!

anth-volk commented Dec 8, 2025

Uh oh!

anth-volk commented Dec 9, 2025

Uh oh!

anth-volk commented Dec 9, 2025

Uh oh!

policyengine bot commented Dec 9, 2025 •

edited by anth-volk

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Explicitly use default datasets for US state-level simulations #2953

Explicitly use default datasets for US state-level simulations #2953

Uh oh!

Conversation

DTrim99 commented Dec 4, 2025 • edited by anth-volk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Dataset Selection Logic

Test plan

Related PRs

Uh oh!

codecov bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

anth-volk commented Dec 8, 2025

Uh oh!

anth-volk commented Dec 9, 2025

Uh oh!

anth-volk commented Dec 9, 2025

Uh oh!

policyengine bot commented Dec 9, 2025 • edited by anth-volk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Specific Changes:

Files Modified:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DTrim99 commented Dec 4, 2025 •

edited by anth-volk

Loading

codecov bot commented Dec 4, 2025 •

edited

Loading

policyengine bot commented Dec 9, 2025 •

edited by anth-volk

Loading