Skip to content

Add safe H5 export API to protect against stale/pseudo-input variable corruption #418

@baogorek

Description

@baogorek

Problem

When creating H5 files from manipulated simulations (e.g., state-swapping households for geographic calibration), users can inadvertently save variables that corrupt calculations on reload. The current to_input_dataframe() method and sim.input_variables property don't protect against several pitfalls we've discovered:

1. Pseudo-input variables (see #417)

Variables with adds/subtracts that aggregate formula-based components appear in sim.input_variables but contain stale pre-computed values. When saved and reloaded, these override the formula calculations.

2. Stale calculated variables

If you change an input (like state_fips for geographic relocation) but don't manually clear the cache with sim.delete_arrays(), calculated variables retain old values.

3. No built-in identification of "true inputs"

Users must reimplement logic to identify variables with formulas/adds/subtracts. The Variable.is_input_variable() method exists but isn't exposed in a way that helps with safe exports.

4. Entity ID sensitivity

PolicyEngine's random() uses entity IDs as seeds. Users need to know which variables to preserve vs. regenerate.

Current Workarounds

In policyengine-us-data, we've had to implement:

  • get_calculated_variables(sim) - identifies variables with formulas/adds/subtracts
  • get_pseudo_input_variables(sim) - identifies pseudo-inputs that shouldn't be saved
  • Manual cache invalidation after changing geographic variables
  • Manual filtering of input_variables before saving

Proposed Solution

Add a safe H5 export API to Simulation that:

  1. Identifies true inputs: Uses Variable.is_input_variable() plus pseudo-input detection
  2. Warns about state changes: If geographic variables changed since load, warn that calculated variables may be stale
  3. Provides a "clean export" mode: Only exports variables safe to reload without corruption
  4. Documents the pitfalls: Clear documentation about what can go wrong when manipulating simulations before saving

This could be a new method like to_safe_h5() or improvements to the existing export functionality.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions