Skip to content

Bug: H5 Dataset Creation Caches 2,719 Calculated Variables, Breaking Uprating and Policy Flexibility #444

@baogorek

Description

@baogorek

Summary

The H5 dataset creation functions in policyengine_us_data/datasets/cps/small_enhanced_cps.py save all variables indiscriminately, including calculated/derived variables that should never be cached. This causes:

  1. Uprating to fail silently - cached values uprate by the wrong factor
  2. Policy reforms to be blocked - cached values prevent formulas from running
  3. Model results to diverge by 2.1%+ - systematic errors in tax calculations
  4. Threshold effects to trigger incorrectly - income and thresholds uprate at different rates

Evidence

Diagnostic test run on sparse_enhanced_cps_2024.h5 reveals:

Total variables in H5: 3,259
Problematic cached variables:
  Formulas (shouldn't be cached):          2,173
  With adds (shouldn't be cached):           546
  TOTAL PROBLEMATIC:                      2,719 (83.4%)

Impact on uprating accuracy:

  • Without filtering: 97.9% accuracy (2.1% discrepancies)
  • With filtering: 100% accuracy (within floating-point precision)
  • Exact repeated errors: $462.51 (7×), $231.25 (8×), $185.00 (10×) - threshold crossing effects

Root Cause

PolicyEngine distinguishes between three types of variables:

  1. True inputs (no formula, no adds, no subtracts)

    • Examples: employment_income_before_lsr, capital_gains
    • Action: MUST be saved to H5
  2. Calculated variables (have formula)

    • Examples: income_tax, adjusted_gross_income, capital_gains_tax
    • Action: MUST NEVER be cached - they recalculate from parameters
  3. Aggregate variables (have adds/subtracts but no formula)

    • Examples: employment_income, social_security, long_term_capital_gains
    • Action: MUST NEVER be cached - they recalculate from components

The current code saves all three types without filtering (lines 22-35 & 82-100):

for variable in simulation.tax_benefit_system.variables:
    data[variable] = {}
    for time_period in simulation.get_holder(variable).get_known_periods():
        values = simulation.get_holder(variable).get_array(time_period)
        # ... saves to H5 without filtering

Policy Reform Blocking

When cached variables are returned from the H5 file, policy reforms cannot affect them because the formula never runs.

Code path in policyengine-core/simulations/simulation.py:630-632:

# First look for a value already cached
cached_array = holder.get_array(period, self.branch_name)
if cached_array is not None:
    return cached_array  # ← Returns immediately, formula never executes

Example: A user applies a reform to change capital gains tax rates

reform = Reform.from_dict({
    "gov.irs.capital_gains.rates": {"1": {"2024-01-01": 0.05}}
})
sim_reformed = Microsimulation(dataset=sparse_enhanced_cps_2024, reform=reform)
result = sim_reformed.calculate("capital_gains_tax", period=2024)

What happens:

  1. capital_gains_tax formula is NOT executed
  2. Instead, the cached 2024 value is returned immediately
  3. The reform changing capital gains rates has zero effect
  4. Result is identical to baseline (no reform)

Impact: Any policy analysis using this dataset will have broken reforms for the 2,719 cached variables. Users will see no change when they apply reforms, thinking their reform was ineffective when actually the dataset is preventing it from working.


Technical Details: Why This Breaks Uprating

When a user loads an H5 file and requests data for a future year (e.g., 2026), PolicyEngine's uprating system applies a single factor to cached values:

if variable.uprating is not None and len(start_instants) > 0:
    uprating_factor = uprating_parameter(2026) / uprating_parameter(2024)
    array = cached_2024_value * uprating_factor

The problem with cached aggregates like employment_income:

  1. It's saved as the 2024 calculated output (sum of components)
  2. When asked for 2026, uprating applies ONE factor to the cached aggregate
  3. But components uprate differently than the aggregate
  4. Result: Misalignment between components and aggregates

Example: Capital Gains Threshold Effect

Capital gains thresholds uprate by CPI-U:        4.53%
qualified_dividend_income if cached uprates by:  12.01%

Person with $47,025 in dividends + capital gains:
- CPI uprating:  $47,025 × 1.0453 = $49,155 (BELOW $49,450 threshold)
- Cached uprating: $47,025 × 1.1201 = $52,672 (crosses $49,450 threshold)
- Tax difference: CG moves from 0% to 15% bracket = $462.51 error

This explains the exact repeated discrepancies:

  • $462.51 appearing 7 times across households
  • $231.25 appearing 8 times
  • Other exact amounts repeating - threshold crossing effects

Affected Files

sparse_enhanced_cps_2024.h5 contains:

  • 3,259 total variables
  • 2,719 problematic variables (83.4%)
    • 2,173 with formulas
    • 546 with adds/subtracts

Specific problematic variables that break policy analysis:

  • income_tax - all tax reforms ineffective
  • capital_gains_tax - capital gains reforms ineffective
  • employment_income - income reforms ineffective
  • self_employment_income - self-employment reforms ineffective
  • social_security - Social Security reforms ineffective
  • Plus 2,714 other calculated variables...

Expected Outcome After Fix

Metrics

  • Before: 2,719 problematic variables (83.4%)
  • After: ~50 true input variables (<1% problematic)

Functionality

✓ Uprating will match to 100% accuracy
✓ Policy reforms will correctly affect all variables
✓ No more threshold crossing errors
✓ Model flexibility fully preserved


Verification

Run this after regenerating the dataset to confirm the fix worked:

import h5py
from policyengine_us import Microsimulation

h5_path = "policyengine_us_data/storage/sparse_enhanced_cps_2024.h5"
with h5py.File(h5_path, 'r') as f:
    h5_vars = set(f.keys())

sim = Microsimulation()
problematic_count = 0

for var_name in h5_vars:
    if var_name not in sim.tax_benefit_system.variables:
        continue
    var = sim.tax_benefit_system.variables[var_name]
    if len(var.formulas) > 0:
        problematic_count += 1
    elif var.adds and len(var.adds) > 0:
        problematic_count += 1
    elif var.subtracts and len(var.subtracts) > 0:
        problematic_count += 1

print(f"Problematic variables: {problematic_count} (should be <100 after fix, currently 2,719)")

Impact

  • Severity: High
  • Scope: sparse_enhanced_cps_2024.h5
  • Breaking change: Yes - requires regenerating the dataset
  • Timeline: Should be fixed before next uprating or policy analysis release

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions