Skip to content

Conversation

@JoiRichi
Copy link

@JoiRichi JoiRichi commented Oct 6, 2025

The main repository is here https://github.com/nextstrain/lassa

@rneher
Copy link
Member

rneher commented Dec 10, 2025

Thank you @JoiRichi ! This looks very good! I have a few questions and suggestions:

  • why do you include a separate dataset of GPC? GPC is coded by the S segment, right? Is there an additional benefit from including a separate GPC dataset (sorry if this is a stupid question, I don't know much about Lassa).
  • I think you need to use higher gap opening penalities. The defaults were chosen for things like SARS-CoV-2 and Lassa is very diverse and one needs to penalize gaps more relative to mismatches. The current parameters results in quite a few frameshifts that are probably not real.
  • I noticed that in the UTRs of s, there are many private mutations. that makes me think that the alignment used for the dataset tree and what is used by Nextclade might not be exactly compatible. Also, there are very many example sequences for S, which makes testing a bit cumbersome.

Let me know if you have questions, happy to iterate on this a bit!

@JoiRichi
Copy link
Author

JoiRichi commented Dec 15, 2025

Hi Prof. Neher,

Thank you very much for your comments and suggestions.

One of our main goals is to optimize these builds for clinical usage. That means minimizing errors in lineage calls and phylogenetic placement as much as possible. In other words, if a sample is called lineage II, we want to be confident enough in that call to support downstream decisions—e.g., administering lineage II antibodies, ideally informed by the closest placements on the tree.

Why a specific GPC dataset?

While GPC is encoded on the S segment, we include a GPC-focused dataset primarily to ensure that—especially in clinical settings—the relevant lineage call and nearest-neighbor relationships reflect the glycoprotein.

There is early evidence for intra-segmental recombination (including within the S segment), which could in principle lead to cases where NP derives from a different lineage and might confuse lineage calls when using the full S segment alone:
https://www.frontiersin.org/journals/microbiology/articles/10.3389/fmicb.2024.1411537/full

From a clinical perspective (antibody administration), what we care about most is the lineage of the GPC and the closest GPC strain for which we have derived neutralizing antibodies. The logic is that even if other parts of the virus belong to another lineage, as long as the GPC belongs to a specific lineage, it should be reasonable to prioritize antibodies developed against that lineage’s GPC.

Gap penalties / frameshifts

Agreed on the gap penalties. At the moment, our focus has been slightly different: we prioritized robust lineage calling and GPC mutation tracking, rather than a full optimization of all Nextclade alignment parameters. Concretely, we empirically tuned one parameter—minimum seed cover—to ensure that only LASV sequences (and not other mammarenaviruses) get assigned lineages and processed. As we move into regular maintenance, we will systematically tune the remaining parameters.

UTRs and alignment challenges

LASV is extremely divergent across lineages, making a high-quality alignment difficult. In some analyses, UTRs (and intergenic regions) are removed and only coding regions are used (e.g., https://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0006971). We suspect the UTR and intergenic regions need population-scale study (e.g., length variation across lineages, indels) before we can handle it optimally. We have started comprehensive work in this direction, and the results should inform better masking/trimming and alignment strategies for UTRs. In the meantime, I will also check specifically for any incompatibility between the dataset-tree alignment and what Nextclade uses.

Example sequences

If I understood your comment correctly: we included many example sequences to improve resolution and coverage. In addition, to further test the S segment build, we ran it on 300+ extra sequences taken from the GPC Nextstrain build that were not included in the S Nextclade dataset.

I would be very happy to continue iterating on this with you, and I’d greatly appreciate any further feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants