Adding Lassa virus #362

JoiRichi · 2025-10-06T22:09:58Z

The main repository is here https://github.com/nextstrain/lassa

rneher · 2025-12-10T08:21:08Z

Thank you @JoiRichi ! This looks very good! I have a few questions and suggestions:

why do you include a separate dataset of GPC? GPC is coded by the S segment, right? Is there an additional benefit from including a separate GPC dataset (sorry if this is a stupid question, I don't know much about Lassa).
I think you need to use higher gap opening penalities. The defaults were chosen for things like SARS-CoV-2 and Lassa is very diverse and one needs to penalize gaps more relative to mismatches. The current parameters results in quite a few frameshifts that are probably not real.
I noticed that in the UTRs of s, there are many private mutations. that makes me think that the alignment used for the dataset tree and what is used by Nextclade might not be exactly compatible. Also, there are very many example sequences for S, which makes testing a bit cumbersome.

Let me know if you have questions, happy to iterate on this a bit!

JoiRichi · 2025-12-15T03:44:56Z

Hi Prof. Neher,

Thank you very much for your comments and suggestions.

One of our main goals is to optimize these builds for clinical usage. That means minimizing errors in lineage calls and phylogenetic placement as much as possible. In other words, if a sample is called lineage II, we want to be confident enough in that call to support downstream decisions—e.g., administering lineage II antibodies, ideally informed by the closest placements on the tree.

Why a specific GPC dataset?

While GPC is encoded on the S segment, we include a GPC-focused dataset primarily to ensure that—especially in clinical settings—the relevant lineage call and nearest-neighbor relationships reflect the glycoprotein.

There is early evidence for intra-segmental recombination (including within the S segment), which could in principle lead to cases where NP derives from a different lineage and might confuse lineage calls when using the full S segment alone:
https://www.frontiersin.org/journals/microbiology/articles/10.3389/fmicb.2024.1411537/full

From a clinical perspective (antibody administration), what we care about most is the lineage of the GPC and the closest GPC strain for which we have derived neutralizing antibodies. The logic is that even if other parts of the virus belong to another lineage, as long as the GPC belongs to a specific lineage, it should be reasonable to prioritize antibodies developed against that lineage’s GPC.

Gap penalties / frameshifts

Agreed on the gap penalties. At the moment, our focus has been slightly different: we prioritized robust lineage calling and GPC mutation tracking, rather than a full optimization of all Nextclade alignment parameters. Concretely, we empirically tuned one parameter—minimum seed cover—to ensure that only LASV sequences (and not other mammarenaviruses) get assigned lineages and processed. As we move into regular maintenance, we will systematically tune the remaining parameters.

UTRs and alignment challenges

LASV is extremely divergent across lineages, making a high-quality alignment difficult. In some analyses, UTRs (and intergenic regions) are removed and only coding regions are used (e.g., https://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0006971). We suspect the UTR and intergenic regions need population-scale study (e.g., length variation across lineages, indels) before we can handle it optimally. We have started comprehensive work in this direction, and the results should inform better masking/trimming and alignment strategies for UTRs. In the meantime, I will also check specifically for any incompatibility between the dataset-tree alignment and what Nextclade uses.

Example sequences

If I understood your comment correctly: we included many example sequences to improve resolution and coverage. In addition, to further test the S segment build, we ran it on 300+ extra sequences taken from the GPC Nextstrain build that were not included in the S Nextclade dataset.

I would be very happy to continue iterating on this with you, and I’d greatly appreciate any further feedback.

added Lassa

d90e6bd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding Lassa virus #362

Adding Lassa virus #362

Uh oh!

JoiRichi commented Oct 6, 2025

Uh oh!

rneher commented Dec 10, 2025

Uh oh!

JoiRichi commented Dec 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adding Lassa virus #362

Are you sure you want to change the base?

Adding Lassa virus #362

Uh oh!

Conversation

JoiRichi commented Oct 6, 2025

Uh oh!

rneher commented Dec 10, 2025

Uh oh!

JoiRichi commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why a specific GPC dataset?

Gap penalties / frameshifts

UTRs and alignment challenges

Example sequences

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JoiRichi commented Dec 15, 2025 •

edited

Loading