Skip to content

Conversation

@theosanderson
Copy link
Member

@theosanderson theosanderson commented Dec 15, 2025

resolves #

Screenshot

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: https://restructure.loculus.org

@theosanderson theosanderson added preview Triggers a deployment to argocd and removed preview Triggers a deployment to argocd labels Dec 15, 2025
@theosanderson theosanderson added the preview Triggers a deployment to argocd label Dec 15, 2025
@loculus-project loculus-project deleted a comment from claude bot Dec 15, 2025
@theosanderson
Copy link
Member Author

@claude Write up a summary of changes in this PR

@claude
Copy link
Contributor

claude bot commented Dec 15, 2025

Claude finished @theosanderson's task in 1m 55s —— View job


Summary of PR Changes

This PR implements a major architectural restructuring of how reference genomes are organized and managed across the Loculus platform, moving from a "suborganism-first" to a "segment-first" data model. This is a significant refactoring that touches both infrastructure configuration and the website frontend.

Core Architectural Change: Segment-First Reference Genome Structure

Previous Model (Suborganism-First)

The old structure organized reference genomes by "suborganism" (essentially a reference name), with segments nested underneath:

referenceGenomes[suborganism][segment] = { sequence, genes }

New Model (Segment-First)

The new structure organizes by segment first, with multiple references available per segment:

referenceGenomes[segment][reference] = { sequence, insdcAccessionFull, genes }

This is a more flexible model that better represents biological reality - different segments of multi-segmented organisms can use different reference strains.

Key Changes by Component

Kubernetes Configuration (kubernetes/loculus/)

values.yaml (+245/-220 lines)

  • Restructured the entire reference genome configuration to use the new segment-first format
  • Updated organism configurations to reflect the new data model

values.schema.json (+47/-57 lines)

  • Updated JSON schema to validate the new segment-first structure
  • Added new field onlyForReferenceName to metadata to replace onlyForSuborganism

_merged-reference-genomes.tpl (+65/-37 lines)

  • Complete rewrite of the Helm template that transforms reference genomes into LAPIS configuration
  • Now handles two modes:
    • Single-reference mode: When all segments have only one reference, no prefixing is used
    • Multi-reference mode: When segments have multiple references, LAPIS names are prefixed with reference name (e.g., CV-A16-main, CV-A16-VP4)
  • This template is critical for generating the proper nucleotide sequences and genes for LAPIS

Website Frontend Changes

New Type Definitions (website/src/types/referencesGenomes.ts)

  • Introduced new types: SegmentFirstReferenceGenomes, ReferenceGenomesLightweightSchema
  • Defined clear type structure for segment-first organization
  • Added helper types for segment names, reference names, and gene names

New Components & Utilities

SegmentReferenceSelector.tsx (NEW, +155 lines)

  • Brand new UI component for selecting references on a per-segment basis
  • Features two display modes:
    • Single segment: Simple dropdown selector
    • Multi-segment: Tabbed interface where each tab represents a segment
  • Visual indicators show which segments have references selected
  • Includes clear button to deselect references

getSegmentAndGeneInfo.tsx (NEW, +57 lines)

  • New utility function to extract segment and gene information from the segment-first schema
  • Handles both single-reference and multi-reference modes
  • Returns structured data about available segments, genes, and whether organism is multi-segmented

isActiveForSelectedReferenceName.tsx (NEW, +17 lines)

  • Helper to determine if a metadata field should be active based on selected reference
  • Replaces previous isActiveForSelectedSuborganism logic

stillRequiresReferenceNameSelection.tsx (NEW, +12 lines)

  • Helper to check if reference selection is still needed for metadata fields
  • Part of the new reference selection workflow

Updated Components

Search Page Components (Multiple files, ~200 lines changed)

  • SearchForm.tsx, SearchFullUI.tsx: Integrated new SegmentReferenceSelector
  • DownloadDialog.tsx, DownloadForm.tsx: Updated to work with segment-first model
  • SuborganismSelector.tsx: Renamed internal references but maintains backward compatibility
  • MutationField.tsx: Updated to use new reference selection helpers

Sequence Details Page (Multiple files, ~150 lines changed)

  • getTableData.ts: Major refactoring to generate table data from segment-first structure
  • DataTable.tsx, SequenceDataUI.tsx: Updated to display data in new format
  • SequencesContainer.tsx: Updated to handle segment-first gene and segment info

Configuration & State Management

  • config.ts (+42/-24 lines): Added getReferenceGenomeLightweightSchema() function to transform heavy reference genome data into lightweight schema for UI
  • useSearchPageState.ts (+19/-1 lines): Added SegmentReferenceSelections type and state management
  • useUrlParamState.ts (+20/-3 lines): Enhanced to support new reference selection in URL parameters

Test Updates

  • All affected test files updated to use segment-first data structures
  • Test data migrated from suborganism-first to segment-first format
  • ~400 lines of test updates across multiple test files

Renamed/Removed Files

  • Removed: getSuborganismSegmentAndGeneInfo.tsx and its tests (replaced by getSegmentAndGeneInfo.tsx)
  • These utilities were specific to the old suborganism-first model

Breaking Changes

This is a breaking change that requires:

  1. Configuration migration: All values.yaml files must be restructured to use segment-first format
  2. API/Schema changes: The metadata field onlyForSuborganism is replaced with onlyForReferenceName
  3. State management: URL parameters and internal state now track references differently

Technical Benefits

  1. More flexible reference management: Each segment can independently use different references
  2. Clearer separation of concerns: Segment structure is separated from reference selection
  3. Better UI/UX: Users can now select references per segment with visual feedback
  4. More accurate biological representation: Multi-segmented organisms can have segment-specific references
  5. Simplified single-reference mode: When only one reference exists, no prefixing is needed (cleaner)

Migration Notes

The PR maintains some backward compatibility through the selectedSuborganism concept in state management, but the underlying data structures have fundamentally changed. Existing deployments will need to:

  • Update their values.yaml reference genome configurations
  • Update any metadata fields using onlyForSuborganism to use onlyForReferenceName
  • Test the new segment reference selection UI

Files Changed Summary

  • 55 files modified: +1,538 insertions, -972 deletions
  • 4 new files added for segment-first functionality
  • 2 files removed (old suborganism-specific utilities)
  • Heavy concentration of changes in:
    • Configuration layer (Kubernetes templates and schemas)
    • Type definitions and utilities
    • Search and sequence details UI components
    • Test files to match new data structures

@theosanderson theosanderson added preview Triggers a deployment to argocd and removed preview Triggers a deployment to argocd labels Dec 15, 2025
@theosanderson theosanderson changed the title wip demo: flip config to segment-then-reference Dec 15, 2025
@theosanderson theosanderson marked this pull request as draft December 15, 2025 17:08
@anna-parker
Copy link
Contributor

Notes: we will need to add a check in prepro (probably also change the config for this) that we cannot have multiple references of the same segment as this is a segment duplication and we dont want to allow that per sample. (At the moment we would only error if there are multiple sequences with the same reference)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview Triggers a deployment to argocd

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants