Skip to content

Conversation

@huddlej
Copy link
Contributor

@huddlej huddlej commented Dec 13, 2025

Description of proposed changes

The main goal of this PR is to automatically fit the MLR to more granular amino acid haplotypes based on current clade annotation and all HA1 substitutions from that parent clade. This finer granularity will allow us to detect new haplotypes that we should be tracking.

To accomplish this goal, I've reorganized the workflow to support fitting the MLR with different variant classifications as in the forecasts-ncov workflow.

Changes include:

  • Config YAML and workflow allows definition of one or more variant classifications (including the current emerging_haplotype and new aa_haplotype)
  • Config YAML and workflow allows definition of one or more data provenances (gisaid)
  • Local MLR JSONs live in new directory structure: results/{data_provenance}/{variant_classification}/{lineage}/{geo_resolution}/mlr/MLR_results.json
  • Remote MLR JSONs live in a new structure:
    • Trial path for this PR: s3://nextstrain-data/files/workflows/forecasts-flu/trial/forecast-aa-haplotypes/gisaid/emerging_haplotype/h3n2/region/mlr/MLR_results.json
    • Production path: s3://nextstrain-data/files/workflows/forecasts-flu/gisaid/emerging_haplotype/h3n2/region/mlr/MLR_results.json
  • Forecasts viz displays frequencies and GAs for both emerging haplotypes and amino acid haplotypes for the selected subtype and geographic resolution
  • Reduced the minimum number of sequences required per geographic location from 150 to 100, allowing us to catch new haplotypes in locations with sparser sequencing
  • Moved min_date and max_date to top-level config, since we want the same time periods for all models in a given run

Testing locally

To test the new visualization interface locally, run the following commands from inside this repo's directory and this branch:

cd viz

# Install dependencies.
npm ci

# Run local viz server.
npm run start

Open http://127.0.0.1:8000/

The following screenshot shows H1N1pdm regional results with log transform and raw data turned on for both variant classifications:

image

Outstanding issues

  • The current workflow assumes that we can use the same pivot for both emerging haplotype and AA haplotypes. This assumption is reasonable, as long as the pivot is a high-frequency variant that is likely to appear in both analyses. It is possible that the AA haplotype analysis could partition a high-frequency variant into more granular variants such that the ancestral variant no longer has enough counts of its own to appear in the analysis. If this happens, we would still need to manually update the pivot or just accept the default fallback behavior of setting the pivot to the last variant.
  • Forecasts viz hardcodes the data provenance and does not provide an interface to select an alternate provenance.

Checklist

  • Update changelog
  • Drop trial S3 path commit (159027e)
  • Deploy MLR JSONs to new S3 paths with run model action on this branch
  • Deploy the new viz site by merging this PR

@huddlej huddlej force-pushed the forecast-aa-haplotypes branch from cd111d6 to 1b72d97 Compare December 16, 2025 17:09
Reorganizes workflow to support fitting the MLR with different variant
classifications as in the forecasts-ncov workflow. In addition to the
original "emerging_haplotype" variant classification, I've added a
"aa_haplotype" classification which uses more granular, automated
haplotype assignments based on current clade annotation and all HA1
substitutions from that parent clade. We will likely need to tune the
minimum number of "clade" sequences per AA haplotype to allow rarer
haplotypes to appear in the analysis. For now, I've kept the same
thresholds for both variant classifications, though.

As part of this reorganization, I've also added support for different
data provenances.

I also realized that it did not make sense to implement different date
thresholds for each potential model output; all of the analyses we run
should represent the same time span. For this reason, I've moved
"min_date" and "max_date" to the top-level config.
More regions or countries are close to the original 150-sequence
threshold but don't get included. Lowering the threshold allows more
regions to be included while keeping the same minimum for clade
inclusion.
@huddlej huddlej force-pushed the forecast-aa-haplotypes branch from 7261427 to 1c17e4b Compare December 23, 2025 17:52
@huddlej huddlej marked this pull request as ready for review December 23, 2025 18:19
Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The viz part is working well. A nice way to finish 2025, hopefully we can make 2026 the year we move this forward

In terms of the modelling, I was surprised at the ~1y forecast window of given (only) ~6m of fitted data. The CIs also look a little odd, e.g. H3N2 / Spain / K:88I is unexpectedly jagged, and many CIs rise before then abruptly ending (maybe because the mean has reached zero?)

Copy link

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comment on the actual results, but the workflow + viz changes look reasonable to me! Left a question about whether we should centralize where emerging/aa haplotypes get derived.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script looked familiar to me and I realized it's almost exactly the same as add_derived_haplotypes.py in seasonal-flu. Should we be running this (and assign_haplotypes.py) as part of run-nextclade.smk in seasonal-flu so that they can just be part of the Nextclade TSV that is downloaded here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, @joverlee521. That's probably the right place to end up. For now, I like having the flexibility to reannotate the original Nextclade file in each workflow, but I made an issue in the seasonal-flu repo for this proposed change.

@huddlej
Copy link
Contributor Author

huddlej commented Dec 23, 2025

Thanks, @jameshadfield!

In terms of the modelling, I was surprised at the ~1y forecast window of given (only) ~6m of fitted data.

The 1-year horizon is there to match our original eLife model's horizon. We will most likely change this in the near future (ha!), since there is so much uncertainty that it's often not informative to have that long of a horizon.

The CIs also look a little odd, e.g. H3N2 / Spain / K:88I is unexpectedly jagged, and many CIs rise before then abruptly ending (maybe because the mean has reached zero?)

I'll look into this more, but I think these are separate issues. I only notice the last one you mention when the HPDIs converge to the same value as the median as a variant is predicted to fix. There is some jaggedness in the log-transform that you don't see as much in the standard view which probably reflects some numerical rounding issues.

huddlej and others added 2 commits December 23, 2025 13:26
Update forecasts viz interface to display MLR results for multiple
variant classifications on a single page per subtype and geographic
resolution, following the pattern from forecasts-ncov [1].

Since the `useModelData` functions lives outside of this repo and
expects a single object with a `modelUrl` attribute, this commit creates
two separate config objects with one per variant classification. We call
the model URL function once per config after the initial config object
is copied for each variant classification. This function now accepts
arguments for the variant classification (to get the correct S3 URL) and
for model date (so we only need the date-based update logic in one
place).

As I implemented this expanded version of the display, I found it
simpler to write the panel titles (the `h2` tags) and descriptions right
in the HTML instead of maintaining a separate title per variant
classification, subtype, and geographic resolution. This new
implementation mimics the forecasts-ncov at the expense of some
flexibility in defining different titles and descriptions per
subtype (which we have never needed to do).

[1] https://github.com/nextstrain/forecasts-ncov/blob/940791b/viz/src/App.jsx

Co-authored-by: james hadfield <hadfield.james@gmail.com>
@huddlej huddlej force-pushed the forecast-aa-haplotypes branch from 25b85fa to b25557b Compare December 23, 2025 21:26
@huddlej huddlej merged commit 2ca14c2 into main Dec 23, 2025
7 checks passed
@huddlej huddlej deleted the forecast-aa-haplotypes branch December 23, 2025 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants