-
Notifications
You must be signed in to change notification settings - Fork 10
Fix county assignment issues #467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes #466 1. Add INVALID_COUNTY_NAMES workaround for 65 bogus upstream enum entries - Excludes entries like DORCHESTER_COUNTY_DE until policyengine-us#7144 is fixed - Delaware now correctly has 3 counties (Kent, New Castle, Sussex) 2. Add zero-probability filter to make_county_cd_distributions.py - Filters 16 rows with probability=0.0 on CSV regeneration 3. Replace NYC binary filtering with probability-based weighting - Add get_county_filter_probability() and get_filtered_county_distribution() - Scale weights by P(target|CD) instead of dropping households - Assign only target counties using normalized distribution - Eliminates sample selection bias in city-level datasets 4. Add audit_county_enum.py for validating County enum against Census 2020 5. Add 19 tests for county assignment validation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Removed 14 false positives from INVALID_COUNTY_NAMES: - 10 Puerto Rico municipios with accented characters - 3 Maryland counties with apostrophes - Doña Ana County, NM These were incorrectly flagged due to Unicode encoding differences in the audit script. All 14 are valid Census 2020 entries. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
8eb5405 to
cac20ec
Compare
The invalid county entries have been removed from policyengine-us in PR #7145, so this workaround is no longer needed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
MaxGhenis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review
Thanks for tackling this - the probability-based weighting approach is a solid improvement over binary filtering, and the test coverage is good.
Questions / Concerns
1. Missing INVALID_COUNTY_NAMES?
The PR description says it adds INVALID_COUNTY_NAMES set (65 entries) to county_assignment.py, but I don't see this in the diff. The test test_delaware_has_exactly_3_counties expects DE to have exactly 3 counties (excluding DORCHESTER_COUNTY_DE), but _build_state_counties() doesn't appear to filter anything. Am I missing something, or is this incomplete?
2. Seed determinism
seed=42 + idxIf CDs are processed in different order (dict ordering changes, parallelization, etc.), results could differ. Consider using seed=42 + int(cd_geoid) for stability.
3. Zero-probability filtering
The diff adds filtering at Step 5b, but this only takes effect when make_county_cd_distributions.py is re-run. Is the regenerated CSV included in this PR, or does it need to be regenerated separately?
Alternative approaches to consider
-
Fix upstream first: Rather than maintaining a 65-entry workaround list, could prioritize getting policyengine-us#7145 merged to reduce technical debt.
-
Dynamic validation: Instead of hardcoding
INVALID_COUNTY_NAMES, could validate against Census data at load time (slower but always current). -
Use Census FIPS as source of truth: Build the CSV from Census FIPS codes directly, only mapping to County enum indices at assignment time. Avoids enum validation issues entirely.
What's good
- Probability-based weighting eliminates selection bias from the old binary filtering approach ✓
- Good test coverage (9 new tests) ✓
- The
audit_county_enum.pyscript is useful for documenting/tracking the upstream problem ✓ - Clear PR description and test plan ✓
The invalid county entries were removed upstream in policyengine-us#7145, released in v1.499.0. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Changes seed from `42 + idx` to `seed + int(cd_geoid)` for order-independent reproducibility. Adds configurable base seed parameter (default 42). Addresses review feedback from @MaxGhenis. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
These county-CD pairs had zero population in Census block data and could never be selected. Matches the filtering logic in Step 5b of make_county_cd_distributions.py. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Thanks for the review, @MaxGhenis!
Oh and the uv.lock was the reason the build failed last time. Ehh, still having some modal issues: Hopefully the tests pass this time. |
Replace pip with uv in Modal apps: - Image now only installs uv (deps come from lock file) - Use `uv sync --locked` to install exact pinned versions - Use `uv run` for all python/pytest commands This ensures Modal uses the same dependency versions as local development and CI, fixing the policyengine-us version mismatch. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
MaxGhenis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! The updates address my earlier concerns:
Seed determinism fixed - Now uses seed + int(cd_geoid) instead of seed + idx, making results order-independent.
CSV regenerated - Zero-probability entries removed.
Modal app modernized - Using uv sync --locked instead of inline pip installs.
Upstream fix approach - I notice there's no INVALID_COUNTY_NAMES set in the code, yet test_delaware_has_exactly_3_counties passes. This suggests policyengine-us 1.499.0 includes the upstream fix from PR #7145. Relying on the upstream fix is cleaner than maintaining a local workaround.
Minor nit: The PR description still mentions "INVALID_COUNTY_NAMES workaround" but this isn't in the code anymore - might want to update the description to reflect that it relies on the upstream fix instead.
Overall: Good to merge. ✅
Summary
Fixes #466
Three fixes for county assignment in local area calibration:
probability=0.0from CSV on regenerationTest plan
Related PRs
🤖 Generated with Claude Code