Skip to content

Conversation

@DTrim99
Copy link
Collaborator

@DTrim99 DTrim99 commented Dec 4, 2025

Note: this summary is now deprecated

Summary

  • Update _setup_data to pass through hf:// and gs:// URLs directly
  • Allows frontend to specify state-specific datasets via full URL
  • Maintains backward compatibility with enhanced_cps keyword

Changes

The _setup_data method now checks if the dataset parameter starts with hf:// or gs:// and passes it through unchanged. This enables the frontend to specify state-specific datasets like:

hf://policyengine/policyengine-us-data/states/CA.h5

Dataset Selection Logic

Input Output
hf://... URL Pass through unchanged
gs://... URL Pass through unchanged
enhanced_cps gs://policyengine-us-data/enhanced_cps_2024.h5
None + US state gs://policyengine-us-data/pooled_3_year_cps_2023.h5 (fallback)
None + US/UK None

Test plan

  • New tests for HF URL passthrough
  • New tests for GS URL passthrough
  • Existing tests still pass
  • Backward compatible with enhanced_cps keyword

Related PRs

🤖 Generated with Claude Code

- Update _setup_data to pass through hf:// and gs:// URLs directly
- Allows frontend to specify state-specific datasets via full URL
- Maintains backward compatibility with enhanced_cps keyword
- Fallback to pooled CPS for states when no dataset specified
- Add tests for HF URL passthrough behavior

This change works with policyengine-app-v2 to enable state-specific
datasets at hf://policyengine/policyengine-us-data/states/{STATE}.h5

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@codecov
Copy link

codecov bot commented Dec 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.87%. Comparing base (81b7179) to head (8532c6f).
⚠️ Report is 12 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2953   +/-   ##
=======================================
  Coverage   74.87%   74.87%           
=======================================
  Files          53       53           
  Lines        1966     1966           
  Branches      254      254           
=======================================
  Hits         1472     1472           
  Misses        444      444           
  Partials       50       50           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@DTrim99 DTrim99 requested a review from MaxGhenis December 4, 2025 22:12
@anth-volk
Copy link
Collaborator

@DTrim99 We should chat about this today

@DTrim99 DTrim99 requested a review from anth-volk December 8, 2025 16:49
@anth-volk
Copy link
Collaborator

As of PolicyEngine/policyengine.py#191, we now by default use Ben's state datasets in the sim API and .py (if using the maintenance branch below version 1). Instead of moving the dataset specification logic upward all the way to app v2 as part of PolicyEngine/policyengine-app-v2#497, I'd propose actually moving it downward - just have the new .py manage the default dataset and let everything on top just say "please choose the default dataset," unless we really need to override it.

What I'm going to do is hijack this PR just to have it select the default dataset for US states, which will be Ben's datasets. I don't think we should pass through HF or GCS links here, and in fact, we shouldn't be using HFs at all due to their rate limiting policy.

@anth-volk anth-volk changed the title Support HuggingFace dataset URLs in economy service Explicitly use default datasets for US state-level simulations Dec 9, 2025
@anth-volk
Copy link
Collaborator

@PolicyEngine summarize what this PR does now, following a series of changes

@policyengine
Copy link

policyengine bot commented Dec 9, 2025

Summary

Following the changes made in this PR, here's what it now does:

Core Change: This PR modifies the dataset selection logic for US state-level simulations to return None (letting policyengine.py choose the default dataset) instead of explicitly specifying a dataset path.

Specific Changes:

  1. Dataset Selection Logic (in economy_service.py):

    • enhanced_cps → Returns gs://policyengine-us-data/enhanced_cps_2024.h5 (unchanged)
    • NYC region → Returns gs://policyengine-us-data/pooled_3_year_cps_2023.h5 (special case added)
    • US state-level simulations → Returns None (changed from returning pooled CPS) (from Anthony: this means we use the default as set up in .py; for the US, this now means Enhanced CPS for nationwide simulations and Ben's state-level datasets for state-level sims)
    • All other cases → Returns None
  2. Why the change: Per @anth-volk's comment, instead of having the API specify state-specific datasets, the logic is being pushed down to policyengine.py (via PR #191). This way, policyengine.py manages the default dataset selection, and the API just says "use the default" by passing None.

  3. NYC Exception: NYC is a special case that still needs to explicitly use the pooled 3-year CPS dataset rather than any state-specific dataset.

Files Modified:

  • policyengine_api/services/economy_service.py: Updated _setup_data() method
  • tests/unit/services/test_economy_service.py: Updated tests to reflect new behavior
  • changelog_entry.yaml: Added changelog entry

The PR effectively simplifies the dataset selection at the API level by delegating to policyengine.py for default dataset management, except for the NYC special case.

@anth-volk anth-volk merged commit 9a7bfc0 into master Dec 9, 2025
7 checks passed
@anth-volk anth-volk deleted the feature/state-specific-datasets branch December 9, 2025 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants