Bug: H5 Dataset Creation Caches 2,719 Calculated Variables, Breaking Uprating and Policy Flexibility

## Summary

The H5 dataset creation functions in `policyengine_us_data/datasets/cps/small_enhanced_cps.py` save **all variables indiscriminately**, including calculated/derived variables that should never be cached. This causes:

1. **Uprating to fail silently** - cached values uprate by the wrong factor
2. **Policy reforms to be blocked** - cached values prevent formulas from running
3. **Model results to diverge by 2.1%+** - systematic errors in tax calculations  
4. **Threshold effects to trigger incorrectly** - income and thresholds uprate at different rates

### Evidence

**Diagnostic test run on `sparse_enhanced_cps_2024.h5` reveals:**
```
Total variables in H5: 3,259
Problematic cached variables:
  Formulas (shouldn't be cached):          2,173
  With adds (shouldn't be cached):           546
  TOTAL PROBLEMATIC:                      2,719 (83.4%)
```

**Impact on uprating accuracy:**
- Without filtering: 97.9% accuracy (2.1% discrepancies)
- With filtering: 100% accuracy (within floating-point precision)
- Exact repeated errors: $462.51 (7×), $231.25 (8×), $185.00 (10×) - threshold crossing effects

---

## Root Cause

PolicyEngine distinguishes between three types of variables:

1. **True inputs** (no formula, no `adds`, no `subtracts`)
   - Examples: `employment_income_before_lsr`, `capital_gains`
   - Action: MUST be saved to H5
   
2. **Calculated variables** (have `formula`)
   - Examples: `income_tax`, `adjusted_gross_income`, `capital_gains_tax`
   - Action: MUST NEVER be cached - they recalculate from parameters
   
3. **Aggregate variables** (have `adds`/`subtracts` but no formula)
   - Examples: `employment_income`, `social_security`, `long_term_capital_gains`
   - Action: MUST NEVER be cached - they recalculate from components

The current code saves **all three types** without filtering (lines 22-35 & 82-100):

```python
for variable in simulation.tax_benefit_system.variables:
    data[variable] = {}
    for time_period in simulation.get_holder(variable).get_known_periods():
        values = simulation.get_holder(variable).get_array(time_period)
        # ... saves to H5 without filtering
```

---

## Policy Reform Blocking

When cached variables are returned from the H5 file, policy reforms **cannot affect them** because the formula never runs.

**Code path in `policyengine-core/simulations/simulation.py:630-632`:**

```python
# First look for a value already cached
cached_array = holder.get_array(period, self.branch_name)
if cached_array is not None:
    return cached_array  # ← Returns immediately, formula never executes
```

**Example: A user applies a reform to change capital gains tax rates**

```python
reform = Reform.from_dict({
    "gov.irs.capital_gains.rates": {"1": {"2024-01-01": 0.05}}
})
sim_reformed = Microsimulation(dataset=sparse_enhanced_cps_2024, reform=reform)
result = sim_reformed.calculate("capital_gains_tax", period=2024)
```

**What happens:**
1. `capital_gains_tax` formula is NOT executed
2. Instead, the cached 2024 value is returned immediately
3. The reform changing capital gains rates has **zero effect**
4. Result is identical to baseline (no reform)

**Impact:** Any policy analysis using this dataset will have broken reforms for the 2,719 cached variables. Users will see no change when they apply reforms, thinking their reform was ineffective when actually the dataset is preventing it from working.

---

## Technical Details: Why This Breaks Uprating

When a user loads an H5 file and requests data for a future year (e.g., 2026), PolicyEngine's uprating system applies a single factor to cached values:

```python
if variable.uprating is not None and len(start_instants) > 0:
    uprating_factor = uprating_parameter(2026) / uprating_parameter(2024)
    array = cached_2024_value * uprating_factor
```

**The problem with cached aggregates like `employment_income`:**

1. It's saved as the 2024 **calculated output** (sum of components)
2. When asked for 2026, uprating applies ONE factor to the cached aggregate
3. But components uprate differently than the aggregate
4. Result: Misalignment between components and aggregates

**Example: Capital Gains Threshold Effect**

```
Capital gains thresholds uprate by CPI-U:        4.53%
qualified_dividend_income if cached uprates by:  12.01%

Person with $47,025 in dividends + capital gains:
- CPI uprating:  $47,025 × 1.0453 = $49,155 (BELOW $49,450 threshold)
- Cached uprating: $47,025 × 1.1201 = $52,672 (crosses $49,450 threshold)
- Tax difference: CG moves from 0% to 15% bracket = $462.51 error
```

This explains the exact repeated discrepancies:
- $462.51 appearing 7 times across households
- $231.25 appearing 8 times  
- Other exact amounts repeating - threshold crossing effects

---

## Affected Files

`sparse_enhanced_cps_2024.h5` contains:
- **3,259 total variables**
- **2,719 problematic variables (83.4%)**
  - 2,173 with formulas
  - 546 with adds/subtracts

**Specific problematic variables that break policy analysis:**
- `income_tax` - all tax reforms ineffective
- `capital_gains_tax` - capital gains reforms ineffective
- `employment_income` - income reforms ineffective
- `self_employment_income` - self-employment reforms ineffective
- `social_security` - Social Security reforms ineffective
- Plus 2,714 other calculated variables...

---

## Expected Outcome After Fix

### Metrics
- **Before:** 2,719 problematic variables (83.4%)
- **After:** ~50 true input variables (<1% problematic)

### Functionality
✓ Uprating will match to 100% accuracy
✓ Policy reforms will correctly affect all variables
✓ No more threshold crossing errors
✓ Model flexibility fully preserved

---

## Verification

Run this after regenerating the dataset to confirm the fix worked:

```python
import h5py
from policyengine_us import Microsimulation

h5_path = "policyengine_us_data/storage/sparse_enhanced_cps_2024.h5"
with h5py.File(h5_path, 'r') as f:
    h5_vars = set(f.keys())

sim = Microsimulation()
problematic_count = 0

for var_name in h5_vars:
    if var_name not in sim.tax_benefit_system.variables:
        continue
    var = sim.tax_benefit_system.variables[var_name]
    if len(var.formulas) > 0:
        problematic_count += 1
    elif var.adds and len(var.adds) > 0:
        problematic_count += 1
    elif var.subtracts and len(var.subtracts) > 0:
        problematic_count += 1

print(f"Problematic variables: {problematic_count} (should be <100 after fix, currently 2,719)")
```

---

## Impact

- **Severity:** High
- **Scope:** `sparse_enhanced_cps_2024.h5`
- **Breaking change:** Yes - requires regenerating the dataset
- **Timeline:** Should be fixed before next uprating or policy analysis release

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: H5 Dataset Creation Caches 2,719 Calculated Variables, Breaking Uprating and Policy Flexibility #444

Summary

Evidence

Root Cause

Policy Reform Blocking

Technical Details: Why This Breaks Uprating

Affected Files

Expected Outcome After Fix

Metrics

Functionality

Verification

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: H5 Dataset Creation Caches 2,719 Calculated Variables, Breaking Uprating and Policy Flexibility #444

Description

Summary

Evidence

Root Cause

Policy Reform Blocking

Technical Details: Why This Breaks Uprating

Affected Files

Expected Outcome After Fix

Metrics

Functionality

Verification

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions