-
Notifications
You must be signed in to change notification settings - Fork 26
Description
Problem
When creating H5 files from manipulated simulations (e.g., state-swapping households for geographic calibration), users can inadvertently save variables that corrupt calculations on reload. The current to_input_dataframe() method and sim.input_variables property don't protect against several pitfalls we've discovered:
1. Pseudo-input variables (see #417)
Variables with adds/subtracts that aggregate formula-based components appear in sim.input_variables but contain stale pre-computed values. When saved and reloaded, these override the formula calculations.
2. Stale calculated variables
If you change an input (like state_fips for geographic relocation) but don't manually clear the cache with sim.delete_arrays(), calculated variables retain old values.
3. No built-in identification of "true inputs"
Users must reimplement logic to identify variables with formulas/adds/subtracts. The Variable.is_input_variable() method exists but isn't exposed in a way that helps with safe exports.
4. Entity ID sensitivity
PolicyEngine's random() uses entity IDs as seeds. Users need to know which variables to preserve vs. regenerate.
Current Workarounds
In policyengine-us-data, we've had to implement:
get_calculated_variables(sim)- identifies variables with formulas/adds/subtractsget_pseudo_input_variables(sim)- identifies pseudo-inputs that shouldn't be saved- Manual cache invalidation after changing geographic variables
- Manual filtering of
input_variablesbefore saving
Proposed Solution
Add a safe H5 export API to Simulation that:
- Identifies true inputs: Uses
Variable.is_input_variable()plus pseudo-input detection - Warns about state changes: If geographic variables changed since load, warn that calculated variables may be stale
- Provides a "clean export" mode: Only exports variables safe to reload without corruption
- Documents the pitfalls: Clear documentation about what can go wrong when manipulating simulations before saving
This could be a new method like to_safe_h5() or improvements to the existing export functionality.
Related
- PolicyEngine has two different concepts of "input variable" that can diverge #417 - Pseudo-input variables with
adds/subtractscan corrupt H5 exports