Skip to content

Conversation

@anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Dec 19, 2025

resolves #

Summary

Add a new field to the sequences_entries_view called ena_deposition_status which is an enum of type:

enum class EnaDepositionStatus {
    INGESTED,
    READY,
    DEPOSITED,
}

This column is used as a coarse-grained filter option when calling the get-released-data endpoint.

By adding the param filterForEnaDeposition the endpoints stops streaming sequences that were ingested from the INSDC or have been deposited by us in the INSDC with external metadata updated. Edge cases, such as revised versions, restricted data etc. are still handled by the ena cronjob.

The cronjob now takes 1min and 30seconds to run instead of the 15-20min it currently takes.

14:40:08     INFO (           config.py: 135) - Submitting to ENA dev environment
...
14:41:31     INFO (     call_loculus.py: 160) - Fetching released data from https://backend-staging.pathoplexus.org/west-nile/get-released-data with request id 71af1dab-c6b1-4254-b7ae-5da099d29856

I also confirmed with spot checking that the new status was set correctly for these submissions.

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: https://deposited-status-filter.loculus.org

@anna-parker anna-parker force-pushed the deposited_status_filter branch from f98fb5d to a972104 Compare December 19, 2025 14:28
@anna-parker anna-parker changed the title feat(backend, deposition): add a deposition status to the backend and… feat(backend, deposition): add a deposition status to the backend and use to filter get-released-data prior to streaming Dec 19, 2025
@anna-parker anna-parker marked this pull request as ready for review December 19, 2025 16:36
@claude
Copy link
Contributor

claude bot commented Dec 19, 2025

Code Review

This PR implements a coarse-grained filter for the get-released-data endpoint to improve performance of the ENA deposition cronjob. The change reduces execution time from 15-20 minutes to ~1.5 minutes by filtering out sequences that don't need ENA submission at the database level.

Overall Assessment

The implementation is well-designed and achieves its performance goal effectively. The code quality is good with clear separation of concerns between database logic, API layer, and client code.


Positive Aspects

Performance improvement: Dramatic reduction in execution time (15-20min → 1.5min) by filtering at the database level rather than in application code
Clean architecture: The ena_deposition_status is computed in the database view, keeping business logic centralized
Backward compatibility: The new parameter is optional with a sensible default (false), so existing clients continue to work
Clear logic: The three-state enum (INGESTED, READY, DEPOSITED) clearly represents the lifecycle


Issues & Concerns

🔴 Critical: Missing Test Coverage

The PR adds a new query parameter and significant database logic but doesn't include any tests. This is concerning because:

  1. No test for the new parameter: There are no tests verifying that enaDeposition=true correctly filters sequences
  2. No test for the status computation: The ena_deposition_status CASE logic in the view isn't tested
  3. Edge cases untested: What happens with:
    • Sequences that have external metadata from sources other than ENA?
    • Group ID 1 sequences (INSDC ingest) that also have ENA updater metadata?
    • Revisions of deposited sequences?

Recommendation: Add tests to GetReleasedDataEndpointTest.kt covering:

  • Fetching with enaDeposition=true returns only READY sequences
  • Fetching with enaDeposition=false (default) returns all sequences
  • Verification that INGESTED and DEPOSITED sequences are correctly excluded when filter is enabled

🟡 Potential Logic Issue: Hardcoded Group ID

WHEN (se.group_id = 1) THEN 'INGESTED'::text

The use of hardcoded group_id = 1 for INSDC ingest appears in multiple places:

  • backend/docs/db/schema.sql:474
  • backend/src/main/resources/db/migration/V1.25__add_ena_deposition_status.sql:20
  • preprocessing/nextclade/src/loculus_preprocessing/config.py:91

Concerns:

  1. Magic number without documentation in the SQL
  2. If the INSDC group ID ever changes (database reset, different deployment), this logic breaks silently
  3. No validation that group ID 1 exists and is actually the INSDC ingest group

Recommendations:

  • Add a SQL comment explaining what group_id = 1 represents
  • Consider using a configuration value or named constant that's validated at startup
  • Or query for the group by name rather than hardcoded ID

🟡 SQL Migration Duplicates View Logic

The migration file V1.25__add_ena_deposition_status.sql contains the entire view definition, which duplicates the logic in schema.sql. This is standard for Flyway migrations, but creates maintenance burden.

Observation: If the view logic changes in the future, developers must remember to update both files. Consider adding a comment in schema.sql pointing to the migration file.

🟡 Parameter Naming Inconsistency

The parameter is called enaDeposition (camelCase) in the API:

@RequestParam(name = "enaDeposition", required = false, defaultValue = "false") enaDeposition: Boolean

But in Python it's passed as a query parameter with the same name. Consider if a more descriptive name like filterForEnaDeposition or excludeDepositedSequences would be clearer about what the parameter does.

🟡 Missing API Documentation

The new query parameter isn't documented in the controller:

@RequestParam(
    name = "enaDeposition",
    required = false,
    defaultValue = "false",
) enaDeposition: Boolean,

Recommendation: Add a @Parameter annotation with description, similar to the existing ifNoneMatch parameter:

@Parameter(
    description = "(Optional) Filter to return only sequences ready for ENA deposition (excludes already deposited and ingested sequences)",
)
@RequestParam(name = "enaDeposition", required = false, defaultValue = "false") enaDeposition: Boolean,

Code Quality Observations

Minor: Formatting Issue in SubmissionDatabaseService.kt

The conditional logic around line 721-726 has inconsistent formatting:

(
    if (enaDeposition) {
        (SequenceEntriesView.enDepositionStatusIs(EnaDepositionStatus.READY))
    } else {
        Op.TRUE
    }
)

The extra parentheses around the entire conditional and around READY check are unnecessary. Could be simplified to:

if (enaDeposition) {
    SequenceEntriesView.enDepositionStatusIs(EnaDepositionStatus.READY)
} else {
    Op.TRUE
}

Good: Type Safety

Nice use of the EnaDepositionStatus enum rather than string literals throughout the code.


Security & Performance

SQL Injection: No concerns - using Exposed framework's type-safe query builders
Performance: The filter is applied at the database level with a CASE statement in the view, which should be efficient
Backward compatibility: Optional parameter means no breaking changes


Recommendations Summary

Must address:

  1. Add comprehensive tests for the new filtering behavior
  2. Document the hardcoded group_id = 1 or make it configurable

Should address:
3. Add @Parameter documentation for the new query parameter
4. Consider a more descriptive parameter name

Nice to have:
5. Clean up unnecessary parentheses in the conditional logic
6. Add SQL comment explaining the group_id constant


Questions

  1. What happens if a sequence is both ingested (group_id = 1) AND has external metadata with ENA updater? Which status takes precedence? The current logic makes DEPOSITED take precedence, is this intentional?

  2. Are there integration tests covering the end-to-end ENA submission flow that would catch regressions?

  3. Should there be monitoring/metrics on how many sequences are filtered by this endpoint?

Overall, this is a solid performance optimization with clean implementation. The main concern is the lack of tests for the new behavior. With tests added, this would be ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview Triggers a deployment to argocd update_db_schema

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants