Add support to model AA haplotypes #26

huddlej · 2025-12-13T00:24:58Z

Description of proposed changes

The main goal of this PR is to automatically fit the MLR to more granular amino acid haplotypes based on current clade annotation and all HA1 substitutions from that parent clade. This finer granularity will allow us to detect new haplotypes that we should be tracking.

To accomplish this goal, I've reorganized the workflow to support fitting the MLR with different variant classifications as in the forecasts-ncov workflow.

Changes include:

Config YAML and workflow allows definition of one or more variant classifications (including the current emerging_haplotype and new aa_haplotype)
Config YAML and workflow allows definition of one or more data provenances (gisaid)
Local MLR JSONs live in new directory structure: results/{data_provenance}/{variant_classification}/{lineage}/{geo_resolution}/mlr/MLR_results.json
Remote MLR JSONs live in a new structure:
- Trial path for this PR: s3://nextstrain-data/files/workflows/forecasts-flu/trial/forecast-aa-haplotypes/gisaid/emerging_haplotype/h3n2/region/mlr/MLR_results.json
- Production path: s3://nextstrain-data/files/workflows/forecasts-flu/gisaid/emerging_haplotype/h3n2/region/mlr/MLR_results.json
Forecasts viz displays frequencies and GAs for both emerging haplotypes and amino acid haplotypes for the selected subtype and geographic resolution
Reduced the minimum number of sequences required per geographic location from 150 to 100, allowing us to catch new haplotypes in locations with sparser sequencing
Moved min_date and max_date to top-level config, since we want the same time periods for all models in a given run

Testing locally

To test the new visualization interface locally, run the following commands from inside this repo's directory and this branch:

cd viz

# Install dependencies.
npm ci

# Run local viz server.
npm run start

Open http://127.0.0.1:8000/

The following screenshot shows H1N1pdm regional results with log transform and raw data turned on for both variant classifications:

Outstanding issues

The current workflow assumes that we can use the same pivot for both emerging haplotype and AA haplotypes. This assumption is reasonable, as long as the pivot is a high-frequency variant that is likely to appear in both analyses. It is possible that the AA haplotype analysis could partition a high-frequency variant into more granular variants such that the ancestral variant no longer has enough counts of its own to appear in the analysis. If this happens, we would still need to manually update the pivot or just accept the default fallback behavior of setting the pivot to the last variant.
Forecasts viz hardcodes the data provenance and does not provide an interface to select an alternate provenance.

Checklist

Update changelog
Drop trial S3 path commit (159027e)
Deploy MLR JSONs to new S3 paths with run model action on this branch
Deploy the new viz site by merging this PR

Reorganizes workflow to support fitting the MLR with different variant classifications as in the forecasts-ncov workflow. In addition to the original "emerging_haplotype" variant classification, I've added a "aa_haplotype" classification which uses more granular, automated haplotype assignments based on current clade annotation and all HA1 substitutions from that parent clade. We will likely need to tune the minimum number of "clade" sequences per AA haplotype to allow rarer haplotypes to appear in the analysis. For now, I've kept the same thresholds for both variant classifications, though. As part of this reorganization, I've also added support for different data provenances. I also realized that it did not make sense to implement different date thresholds for each potential model output; all of the analyses we run should represent the same time span. For this reason, I've moved "min_date" and "max_date" to the top-level config.

More regions or countries are close to the original 150-sequence threshold but don't get included. Lowering the threshold allows more regions to be included while keeping the same minimum for clade inclusion.

jameshadfield

The viz part is working well. A nice way to finish 2025, hopefully we can make 2026 the year we move this forward

In terms of the modelling, I was surprised at the ~1y forecast window of given (only) ~6m of fitted data. The CIs also look a little odd, e.g. H3N2 / Spain / K:88I is unexpectedly jagged, and many CIs rise before then abruptly ending (maybe because the mean has reached zero?)

CHANGES.md

joverlee521

Can't comment on the actual results, but the workflow + viz changes look reasonable to me! Left a question about whether we should centralize where emerging/aa haplotypes get derived.

joverlee521 · 2025-12-23T20:57:46Z

scripts/assign_aa_haplotypes.py

This script looked familiar to me and I realized it's almost exactly the same as add_derived_haplotypes.py in seasonal-flu. Should we be running this (and assign_haplotypes.py) as part of run-nextclade.smk in seasonal-flu so that they can just be part of the Nextclade TSV that is downloaded here?

Good point, @joverlee521. That's probably the right place to end up. For now, I like having the flexibility to reannotate the original Nextclade file in each workflow, but I made an issue in the seasonal-flu repo for this proposed change.

huddlej · 2025-12-23T21:14:38Z

Thanks, @jameshadfield!

In terms of the modelling, I was surprised at the ~1y forecast window of given (only) ~6m of fitted data.

The 1-year horizon is there to match our original eLife model's horizon. We will most likely change this in the near future (ha!), since there is so much uncertainty that it's often not informative to have that long of a horizon.

The CIs also look a little odd, e.g. H3N2 / Spain / K:88I is unexpectedly jagged, and many CIs rise before then abruptly ending (maybe because the mean has reached zero?)

I'll look into this more, but I think these are separate issues. I only notice the last one you mention when the HPDIs converge to the same value as the median as a variant is predicted to fix. There is some jaggedness in the log-transform that you don't see as much in the standard view which probably reflects some numerical rounding issues.

Update forecasts viz interface to display MLR results for multiple variant classifications on a single page per subtype and geographic resolution, following the pattern from forecasts-ncov [1]. Since the `useModelData` functions lives outside of this repo and expects a single object with a `modelUrl` attribute, this commit creates two separate config objects with one per variant classification. We call the model URL function once per config after the initial config object is copied for each variant classification. This function now accepts arguments for the variant classification (to get the correct S3 URL) and for model date (so we only need the date-based update logic in one place). As I implemented this expanded version of the display, I found it simpler to write the panel titles (the `h2` tags) and descriptions right in the HTML instead of maintaining a separate title per variant classification, subtype, and geographic resolution. This new implementation mimics the forecasts-ncov at the expense of some flexibility in defining different titles and descriptions per subtype (which we have never needed to do). [1] https://github.com/nextstrain/forecasts-ncov/blob/940791b/viz/src/App.jsx Co-authored-by: james hadfield <hadfield.james@gmail.com>

huddlej force-pushed the forecast-aa-haplotypes branch from cd111d6 to 1b72d97 Compare December 16, 2025 17:09

huddlej added 2 commits December 23, 2025 09:45

Lower threshold for location to be included

d12bfcd

More regions or countries are close to the original 150-sequence threshold but don't get included. Lowering the threshold allows more regions to be included while keeping the same minimum for clade inclusion.

huddlej force-pushed the forecast-aa-haplotypes branch from 7261427 to 1c17e4b Compare December 23, 2025 17:52

huddlej marked this pull request as ready for review December 23, 2025 18:19

huddlej requested review from jameshadfield, joverlee521 and trvrb December 23, 2025 18:20

jameshadfield approved these changes Dec 23, 2025

View reviewed changes

CHANGES.md Show resolved Hide resolved

joverlee521 mentioned this pull request Dec 23, 2025

useModelData: memoize the config object nextstrain/forecasts-viz#32

Open

2 tasks

joverlee521 reviewed Dec 23, 2025

View reviewed changes

huddlej mentioned this pull request Dec 23, 2025

Annotate Nextclade TSV with emerging haplotypes and amino acid haplotypes during ingest nextstrain/seasonal-flu#288

Open

huddlej force-pushed the forecast-aa-haplotypes branch from b61cf5a to 25b85fa Compare December 23, 2025 21:25

huddlej and others added 2 commits December 23, 2025 13:26

Update change log

b25557b

huddlej force-pushed the forecast-aa-haplotypes branch from 25b85fa to b25557b Compare December 23, 2025 21:26

huddlej merged commit 2ca14c2 into main Dec 23, 2025
7 checks passed

huddlej deleted the forecast-aa-haplotypes branch December 23, 2025 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support to model AA haplotypes #26

Add support to model AA haplotypes #26

Uh oh!

huddlej commented Dec 13, 2025 •

edited

Loading

Uh oh!

jameshadfield left a comment

Uh oh!

Uh oh!

joverlee521 left a comment

Uh oh!

joverlee521 Dec 23, 2025

Uh oh!

huddlej Dec 23, 2025

Uh oh!

huddlej commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add support to model AA haplotypes #26

Add support to model AA haplotypes #26

Uh oh!

Conversation

huddlej commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of proposed changes

Testing locally

Outstanding issues

Checklist

Uh oh!

jameshadfield left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joverlee521 left a comment

Choose a reason for hiding this comment

Uh oh!

joverlee521 Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

huddlej Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

huddlej commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

huddlej commented Dec 13, 2025 •

edited

Loading