-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Summary
The H5 dataset creation functions in policyengine_us_data/datasets/cps/small_enhanced_cps.py save all variables indiscriminately, including calculated/derived variables that should never be cached. This causes:
- Uprating to fail silently - cached values uprate by the wrong factor
- Policy reforms to be blocked - cached values prevent formulas from running
- Model results to diverge by 2.1%+ - systematic errors in tax calculations
- Threshold effects to trigger incorrectly - income and thresholds uprate at different rates
Evidence
Diagnostic test run on sparse_enhanced_cps_2024.h5 reveals:
Total variables in H5: 3,259
Problematic cached variables:
Formulas (shouldn't be cached): 2,173
With adds (shouldn't be cached): 546
TOTAL PROBLEMATIC: 2,719 (83.4%)
Impact on uprating accuracy:
- Without filtering: 97.9% accuracy (2.1% discrepancies)
- With filtering: 100% accuracy (within floating-point precision)
- Exact repeated errors: $462.51 (7×), $231.25 (8×), $185.00 (10×) - threshold crossing effects
Root Cause
PolicyEngine distinguishes between three types of variables:
-
True inputs (no formula, no
adds, nosubtracts)- Examples:
employment_income_before_lsr,capital_gains - Action: MUST be saved to H5
- Examples:
-
Calculated variables (have
formula)- Examples:
income_tax,adjusted_gross_income,capital_gains_tax - Action: MUST NEVER be cached - they recalculate from parameters
- Examples:
-
Aggregate variables (have
adds/subtractsbut no formula)- Examples:
employment_income,social_security,long_term_capital_gains - Action: MUST NEVER be cached - they recalculate from components
- Examples:
The current code saves all three types without filtering (lines 22-35 & 82-100):
for variable in simulation.tax_benefit_system.variables:
data[variable] = {}
for time_period in simulation.get_holder(variable).get_known_periods():
values = simulation.get_holder(variable).get_array(time_period)
# ... saves to H5 without filteringPolicy Reform Blocking
When cached variables are returned from the H5 file, policy reforms cannot affect them because the formula never runs.
Code path in policyengine-core/simulations/simulation.py:630-632:
# First look for a value already cached
cached_array = holder.get_array(period, self.branch_name)
if cached_array is not None:
return cached_array # ← Returns immediately, formula never executesExample: A user applies a reform to change capital gains tax rates
reform = Reform.from_dict({
"gov.irs.capital_gains.rates": {"1": {"2024-01-01": 0.05}}
})
sim_reformed = Microsimulation(dataset=sparse_enhanced_cps_2024, reform=reform)
result = sim_reformed.calculate("capital_gains_tax", period=2024)What happens:
capital_gains_taxformula is NOT executed- Instead, the cached 2024 value is returned immediately
- The reform changing capital gains rates has zero effect
- Result is identical to baseline (no reform)
Impact: Any policy analysis using this dataset will have broken reforms for the 2,719 cached variables. Users will see no change when they apply reforms, thinking their reform was ineffective when actually the dataset is preventing it from working.
Technical Details: Why This Breaks Uprating
When a user loads an H5 file and requests data for a future year (e.g., 2026), PolicyEngine's uprating system applies a single factor to cached values:
if variable.uprating is not None and len(start_instants) > 0:
uprating_factor = uprating_parameter(2026) / uprating_parameter(2024)
array = cached_2024_value * uprating_factorThe problem with cached aggregates like employment_income:
- It's saved as the 2024 calculated output (sum of components)
- When asked for 2026, uprating applies ONE factor to the cached aggregate
- But components uprate differently than the aggregate
- Result: Misalignment between components and aggregates
Example: Capital Gains Threshold Effect
Capital gains thresholds uprate by CPI-U: 4.53%
qualified_dividend_income if cached uprates by: 12.01%
Person with $47,025 in dividends + capital gains:
- CPI uprating: $47,025 × 1.0453 = $49,155 (BELOW $49,450 threshold)
- Cached uprating: $47,025 × 1.1201 = $52,672 (crosses $49,450 threshold)
- Tax difference: CG moves from 0% to 15% bracket = $462.51 error
This explains the exact repeated discrepancies:
- $462.51 appearing 7 times across households
- $231.25 appearing 8 times
- Other exact amounts repeating - threshold crossing effects
Affected Files
sparse_enhanced_cps_2024.h5 contains:
- 3,259 total variables
- 2,719 problematic variables (83.4%)
- 2,173 with formulas
- 546 with adds/subtracts
Specific problematic variables that break policy analysis:
income_tax- all tax reforms ineffectivecapital_gains_tax- capital gains reforms ineffectiveemployment_income- income reforms ineffectiveself_employment_income- self-employment reforms ineffectivesocial_security- Social Security reforms ineffective- Plus 2,714 other calculated variables...
Expected Outcome After Fix
Metrics
- Before: 2,719 problematic variables (83.4%)
- After: ~50 true input variables (<1% problematic)
Functionality
✓ Uprating will match to 100% accuracy
✓ Policy reforms will correctly affect all variables
✓ No more threshold crossing errors
✓ Model flexibility fully preserved
Verification
Run this after regenerating the dataset to confirm the fix worked:
import h5py
from policyengine_us import Microsimulation
h5_path = "policyengine_us_data/storage/sparse_enhanced_cps_2024.h5"
with h5py.File(h5_path, 'r') as f:
h5_vars = set(f.keys())
sim = Microsimulation()
problematic_count = 0
for var_name in h5_vars:
if var_name not in sim.tax_benefit_system.variables:
continue
var = sim.tax_benefit_system.variables[var_name]
if len(var.formulas) > 0:
problematic_count += 1
elif var.adds and len(var.adds) > 0:
problematic_count += 1
elif var.subtracts and len(var.subtracts) > 0:
problematic_count += 1
print(f"Problematic variables: {problematic_count} (should be <100 after fix, currently 2,719)")Impact
- Severity: High
- Scope:
sparse_enhanced_cps_2024.h5 - Breaking change: Yes - requires regenerating the dataset
- Timeline: Should be fixed before next uprating or policy analysis release