Skip to content

Conversation

@baogorek
Copy link
Collaborator

@baogorek baogorek commented Jan 7, 2026

Summary

  • Add Modal serverless compute integration to replace self-hosted GCP runners
  • Create modal_app/data_build.py for dataset builds (32GB, 8 CPU)
  • Create modal_app/local_area.py for local area publishing (8GB, 4 CPU)
  • Update GitHub workflows to trigger Modal instead of running on self-hosted

Closes #464

Test plan

  • Tested modal run modal_app/data_build.py --test-lite --no-upload successfully
  • Verify GitHub Actions can trigger Modal builds
  • Run full data build with upload on Modal
  • Run local area publishing on Modal

🤖 Generated with Claude Code

baogorek and others added 8 commits January 7, 2026 14:19
Replace self-hosted GCP runners with Modal serverless compute for
data builds and local area publishing workflows.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
The --upgrade flag causes platform-specific wheel differences between
local and CI environments. Using --locked just validates consistency.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Modal CLI uses --flag/--no-flag syntax, not --flag=value.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Use github.head_ref for PRs (falls back to github.ref_name for pushes).
github.ref_name returns '465/merge' for PRs which isn't a valid branch.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Write GOOGLE_APPLICATION_CREDENTIALS_JSON to temp file for google.auth.default()
- Increase data build timeout from 2h to 4h

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add local area calibration dataset builds to Modal
- Add local area calibration tests to Modal
- Add main pytest suite to Modal
- Only run basic tests on GitHub runner when full_suite=false
- Install dev dependencies for pytest

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
The local area calibration scripts should run WITHOUT TEST_LITE,
matching the original workflow where only the main data build
had the TEST_LITE env var set. This fixes the size mismatch error
where main CPS was built with TEST_LITE (28k tax units) but local
area expected full size (76k tax units).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@baogorek baogorek requested a review from MaxGhenis January 7, 2026 22:31
@MaxGhenis
Copy link
Contributor

PR Review

✅ Overall Assessment: APPROVE

This is a solid infrastructure PR migrating from self-hosted GCP runners to Modal serverless compute. CI passes and the core implementation looks correct.

🟡 Should Address

  1. Duplicate dependency lists - modal_app/data_build.py and modal_app/local_area.py both have nearly identical pip_install() lists (lines 17-38 in both files). Consider extracting to a shared constant or generating from pyproject.toml to prevent drift.

  2. Unused volume - data_volume is mounted at /data but all work happens in /root/policyengine-us-data. Either use the volume for caching/artifacts or remove it.

  3. Hardcoded dependency versions - The Modal image pins versions like policyengine-us>=1.353.0 that could drift from the actual project dependencies.

🟢 Suggestions

  1. Long timeout on local_area.py - 24-hour timeout (86400s) is very long. If this is intentional (checkpointing across many areas), document why; otherwise consider reducing.

  2. Error handling - subprocess.run(..., check=True) will raise on failure but the error context may be limited. Consider wrapping with try/except for better error messages.

💰 Cost Validation

Based on current Modal pricing:

  • data_build.py (32GB, 8 CPU): ~$0.63/hr → ~$2.50 for 4hr max runtime
  • local_area.py (8GB, 4 CPU): ~$0.25/hr

This is likely cheaper than self-hosted GCP runners with the benefit of no infrastructure maintenance.


Validation Summary

Check Result
CI Status ✅ All passing
Code Patterns Minor duplication
Security ✅ Using Modal secrets properly
Resource Config ✅ Reasonable specs

🤖 Generated with Claude Code

- Remove data_volume which was mounted at /data but never used
- Document 24h timeout in local_area.py (processes all states/districts)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@baogorek
Copy link
Collaborator Author

baogorek commented Jan 8, 2026

Here are the decisions behind what's running now:

● What we did

  | Change                    | Risk | Rationale                                                                                                       |
  |---------------------------|------|-----------------------------------------------------------------------------------------------------------------|
  | Remove unused data_volume | Low  | Mounted at /data but all work happens in /root/policyengine-us-data. Dead code.                                 |
  | Add timeout comment       | None | 24-hour timeout looks like a mistake without context. Now clear it's intentional for 50 states + 435 districts. |

  What we skipped

  | Suggestion                    | Why skip                                                                                                                                                                                                           |
  |-------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
  | Dependency list deduplication | The explicit pip_install() list is just for Modal image caching. The real dependencies come from pip install -e . which reads pyproject.toml. Drift in the cache list causes slower builds, not functional issues. |
  | Hardcoded version alignment   | Already in sync with pyproject.toml. And again, runtime uses the -e . install as source of truth.                                                                                                                  |
  | Error handling wrapper        | subprocess.run(..., check=True) already raises CalledProcessError with command and returncode. Modal captures stdout/stderr. Extra try/except adds boilerplate without meaningful benefit.                         |

  Guiding principle: Fix things that are clearly wrong (dead code, missing documentation) but don't add complexity to solve problems that don't exist in practice.

@MaxGhenis MaxGhenis merged commit af124b7 into main Jan 9, 2026
7 checks passed
@baogorek baogorek deleted the modal branch January 9, 2026 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate CI/CD from GCP self-hosted runners to Modal

3 participants