feat(backend, deposition): add a deposition status to the backend and use to filter get-released-data prior to streaming #5767

anna-parker · 2025-12-19T10:01:55Z

resolves #

Summary

Add a new field to the sequences_entries_view called ena_deposition_status which is an enum of type:

enum class EnaDepositionStatus {
    INGESTED,
    READY,
    DEPOSITED,
}

This column is used as a coarse-grained filter option when calling the get-released-data endpoint.

By adding the param filterForEnaDeposition the endpoints stops streaming sequences that were ingested from the INSDC or have been deposited by us in the INSDC with external metadata updated. Edge cases, such as revised versions, restricted data etc. are still handled by the ena cronjob.

The cronjob now takes 1min and 30seconds to run instead of the 15-20min it currently takes.

14:40:08     INFO (           config.py: 135) - Submitting to ENA dev environment
...
14:41:31     INFO (     call_loculus.py: 160) - Fetching released data from https://backend-staging.pathoplexus.org/west-nile/get-released-data with request id 71af1dab-c6b1-4254-b7ae-5da099d29856

I also confirmed with spot checking that the new status was set correctly for these submissions.

PR Checklist

All necessary documentation has been adapted.
The implemented feature is covered by appropriate, automated tests.
Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: https://deposited-status-filter.loculus.org

… use to restrict number of sequences that need to be streamed

claude · 2025-12-19T16:38:29Z

Code Review

This PR implements a coarse-grained filter for the get-released-data endpoint to improve performance of the ENA deposition cronjob. The change reduces execution time from 15-20 minutes to ~1.5 minutes by filtering out sequences that don't need ENA submission at the database level.

Overall Assessment

The implementation is well-designed and achieves its performance goal effectively. The code quality is good with clear separation of concerns between database logic, API layer, and client code.

Positive Aspects

✅ Performance improvement: Dramatic reduction in execution time (15-20min → 1.5min) by filtering at the database level rather than in application code
✅ Clean architecture: The ena_deposition_status is computed in the database view, keeping business logic centralized
✅ Backward compatibility: The new parameter is optional with a sensible default (false), so existing clients continue to work
✅ Clear logic: The three-state enum (INGESTED, READY, DEPOSITED) clearly represents the lifecycle

Issues & Concerns

🔴 Critical: Missing Test Coverage

The PR adds a new query parameter and significant database logic but doesn't include any tests. This is concerning because:

No test for the new parameter: There are no tests verifying that enaDeposition=true correctly filters sequences
No test for the status computation: The ena_deposition_status CASE logic in the view isn't tested
Edge cases untested: What happens with:
- Sequences that have external metadata from sources other than ENA?
- Group ID 1 sequences (INSDC ingest) that also have ENA updater metadata?
- Revisions of deposited sequences?

Recommendation: Add tests to GetReleasedDataEndpointTest.kt covering:

Fetching with enaDeposition=true returns only READY sequences
Fetching with enaDeposition=false (default) returns all sequences
Verification that INGESTED and DEPOSITED sequences are correctly excluded when filter is enabled

🟡 Potential Logic Issue: Hardcoded Group ID

WHEN (se.group_id = 1) THEN 'INGESTED'::text

The use of hardcoded group_id = 1 for INSDC ingest appears in multiple places:

backend/docs/db/schema.sql:474
backend/src/main/resources/db/migration/V1.25__add_ena_deposition_status.sql:20
preprocessing/nextclade/src/loculus_preprocessing/config.py:91

Concerns:

Magic number without documentation in the SQL
If the INSDC group ID ever changes (database reset, different deployment), this logic breaks silently
No validation that group ID 1 exists and is actually the INSDC ingest group

Recommendations:

Add a SQL comment explaining what group_id = 1 represents
Consider using a configuration value or named constant that's validated at startup
Or query for the group by name rather than hardcoded ID

🟡 SQL Migration Duplicates View Logic

The migration file V1.25__add_ena_deposition_status.sql contains the entire view definition, which duplicates the logic in schema.sql. This is standard for Flyway migrations, but creates maintenance burden.

Observation: If the view logic changes in the future, developers must remember to update both files. Consider adding a comment in schema.sql pointing to the migration file.

🟡 Parameter Naming Inconsistency

The parameter is called enaDeposition (camelCase) in the API:

@RequestParam(name = "enaDeposition", required = false, defaultValue = "false") enaDeposition: Boolean

But in Python it's passed as a query parameter with the same name. Consider if a more descriptive name like filterForEnaDeposition or excludeDepositedSequences would be clearer about what the parameter does.

🟡 Missing API Documentation

The new query parameter isn't documented in the controller:

@RequestParam(
    name = "enaDeposition",
    required = false,
    defaultValue = "false",
) enaDeposition: Boolean,

Recommendation: Add a @Parameter annotation with description, similar to the existing ifNoneMatch parameter:

@Parameter(
    description = "(Optional) Filter to return only sequences ready for ENA deposition (excludes already deposited and ingested sequences)",
)
@RequestParam(name = "enaDeposition", required = false, defaultValue = "false") enaDeposition: Boolean,

Code Quality Observations

Minor: Formatting Issue in SubmissionDatabaseService.kt

The conditional logic around line 721-726 has inconsistent formatting:

(
    if (enaDeposition) {
        (SequenceEntriesView.enDepositionStatusIs(EnaDepositionStatus.READY))
    } else {
        Op.TRUE
    }
)

The extra parentheses around the entire conditional and around READY check are unnecessary. Could be simplified to:

if (enaDeposition) {
    SequenceEntriesView.enDepositionStatusIs(EnaDepositionStatus.READY)
} else {
    Op.TRUE
}

Good: Type Safety

Nice use of the EnaDepositionStatus enum rather than string literals throughout the code.

Security & Performance

✅ SQL Injection: No concerns - using Exposed framework's type-safe query builders
✅ Performance: The filter is applied at the database level with a CASE statement in the view, which should be efficient
✅ Backward compatibility: Optional parameter means no breaking changes

Recommendations Summary

Must address:

Add comprehensive tests for the new filtering behavior
Document the hardcoded group_id = 1 or make it configurable

Should address:
3. Add @Parameter documentation for the new query parameter
4. Consider a more descriptive parameter name

Nice to have:
5. Clean up unnecessary parentheses in the conditional logic
6. Add SQL comment explaining the group_id constant

Questions

What happens if a sequence is both ingested (group_id = 1) AND has external metadata with ENA updater? Which status takes precedence? The current logic makes DEPOSITED take precedence, is this intentional?
Are there integration tests covering the end-to-end ENA submission flow that would catch regressions?
Should there be monitoring/metrics on how many sequences are filtered by this endpoint?

Overall, this is a solid performance optimization with clean implementation. The main concern is the lack of tests for the new behavior. With tests added, this would be ready to merge.

anna-parker added preview Triggers a deployment to argocd update_db_schema labels Dec 19, 2025

anna-parker mentioned this pull request Dec 19, 2025

feat(deposition): read data with orjson instead of json, add better logging, increase chunk size #5765

Merged

3 tasks

anna-parker and others added 5 commits December 19, 2025 15:28

feat(backend, deposition): add a deposition status to the backend and…

4a41984

… use to restrict number of sequences that need to be streamed

format

286155b

wuppsi

625e782

Update schema documentation based on migration changes

fa410e9

create preview

a972104

anna-parker force-pushed the deposited_status_filter branch from f98fb5d to a972104 Compare December 19, 2025 14:28

more logs

2c13c06

anna-parker changed the title ~~feat(backend, deposition): add a deposition status to the backend and…~~ feat(backend, deposition): add a deposition status to the backend and use to filter get-released-data prior to streaming Dec 19, 2025

anna-parker marked this pull request as ready for review December 19, 2025 16:36

anna-parker added 6 commits December 19, 2025 18:02

add a silly test

a3e7b4c

add docs and better naming

ae0044b

improve

10018fe

fix

1631171

improve testing

15c1555

fix tests

4fc012b

anna-parker requested review from corneliusroemer and theosanderson December 19, 2025 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(backend, deposition): add a deposition status to the backend and use to filter get-released-data prior to streaming #5767

feat(backend, deposition): add a deposition status to the backend and use to filter get-released-data prior to streaming #5767

Uh oh!

anna-parker commented Dec 19, 2025 •

edited by loculus-bot

Loading

Uh oh!

claude bot commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(backend, deposition): add a deposition status to the backend and use to filter get-released-data prior to streaming #5767

Are you sure you want to change the base?

feat(backend, deposition): add a deposition status to the backend and use to filter get-released-data prior to streaming #5767

Uh oh!

Conversation

anna-parker commented Dec 19, 2025 • edited by loculus-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

PR Checklist

Uh oh!

claude bot commented Dec 19, 2025

Code Review

Overall Assessment

Positive Aspects

Issues & Concerns

🔴 Critical: Missing Test Coverage

🟡 Potential Logic Issue: Hardcoded Group ID

🟡 SQL Migration Duplicates View Logic

🟡 Parameter Naming Inconsistency

🟡 Missing API Documentation

Code Quality Observations

Minor: Formatting Issue in SubmissionDatabaseService.kt

Good: Type Safety

Security & Performance

Recommendations Summary

Questions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anna-parker commented Dec 19, 2025 •

edited by loculus-bot

Loading