chore(kubernetes, silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job #5683

fengelniederhammer · 2025-12-11T06:49:04Z

resolves #4868

An alternative approach to #4904. Instead of changing the format of the /get-released-data endpoint, we transform the data in the silo-import to the new input format that SILO requires as of 0.8.

The transformation and its test were copied (via an LLM) from https://github.com/GenSpectrum/LAPIS-SILO/blob/bb4bd655864ad41819873f0dd3102e9bfc6ddb61/tools/legacyNdjsonTransformer/main.rs.

Screenshot

PR Checklist

All necessary documentation has been adapted.
The implemented feature is covered by appropriate, automated tests.
Any manual testing that has been done is documented (i.e. what exactly was tested?)
- the preview contains ingested sequences

🚀 Preview: https://4868-update-to-silo-08-an.loculus.org

Copilot

Pull request overview

This PR updates the SILO version to 0.8.6 and introduces a data transformation layer in the silo-import process. The key changes involve transforming downloaded NDJSON records from a nested structure (with separate metadata, sequences, and insertions fields) into a flattened format before passing them to SILO.

Key changes:

Implements transform_record() function to flatten and combine sequence data with insertions
Updates the import pipeline to perform transformation during the download/decompression phase
Adjusts test expectations to match the new transformed data format

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
silo-import/src/silo_import/decompressor.py	Adds record transformation logic and integrates it into the analysis pipeline, creating a separate transformed output file
silo-import/src/silo_import/download_manager.py	Updates download flow to use the new transformation function and handle transformed data paths
silo-import/src/silo_import/constants.py	Adds constants for raw and transformed data filenames
silo-import/tests/test_decompressor.py	Adds comprehensive tests for the new transform_record function
silo-import/tests/test_runner.py	Updates test data structure to match transformed format
silo-import/tests/test_integration.py	Updates integration tests to verify transformed output format
silo-import/tests/helpers.py	Improves NDJSON reading to use streaming decompression
kubernetes/loculus/values.yaml	Updates SILO image version to 0.8.6

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-11T09:36:12Z

silo-import/src/silo_import/decompressor.py

+                    raise RuntimeError(msg) from exc
+
+                transformed_json = json.dumps(transformed, separators=(",", ":"))
+                writer.write(f"{transformed_json}\n".encode())


Encoding each line individually creates multiple small byte objects. Consider encoding the entire string (including newline) in one operation, or accumulate lines in a buffer and write in batches for better performance when processing large files.

silo-import/tests/helpers.py

claude · 2025-12-11T09:37:50Z

PR Review: Update to SILO 0.8.6 with Data Transformation

Summary

This PR successfully updates the LAPIS-SILO version from 0.7.10.2 to 0.8.6 and implements the required data transformation to adapt to SILO 0.8's breaking changes. The implementation introduces a single-pass decompression-transformation-compression pipeline that efficiently transforms NDJSON records into the new format expected by SILO 0.8.

✅ Strengths

Efficient Single-Pass Architecture: The analyze_and_transform_ndjson function cleverly combines decompression, analysis, and transformation in a single pass, minimizing I/O overhead and memory usage.
Comprehensive Test Coverage:
- New unit tests for transform_record covering both null values and populated data
- Updated integration tests to verify transformed output format
- All tests properly adapted to expect the new transformed format
Clean Separation of Concerns: The transform_record function is well-isolated, making it easy to test and understand the transformation logic.
Backward Compatibility: The PR maintains the old DATA_FILENAME constant while introducing new ones (RAW_DATA_FILENAME, TRANSFORMED_DATA_FILENAME), showing careful consideration for the codebase structure.
Error Handling: Transformation errors are caught and wrapped with helpful context including the line number where the error occurred (line 132 in decompressor.py:132).

🔍 Issues & Recommendations

Critical Issues

Resource Cleanup Risk in analyze_and_transform_ndjson (decompressor.py:102-107)

The function opens multiple resources using a grouped context manager:
```
with (
    path.open("rb") as compressed_input,
    decompressor.stream_reader(compressed_input) as reader,
    transformed_path.open("wb") as compressed_output,
    compressor.stream_writer(compressed_output) as writer,
):
```
Issue: If the transformation fails mid-stream, the output file will be left in a partially written state. The except zstandard.ZstdError block on line 137 doesn't clean up the incomplete transformed_path file.

Recommendation: Add cleanup in the exception handler:
```
except zstandard.ZstdError as exc:
    msg = f"Failed to compress/decompress {path}: {exc}"
    if transformed_path.exists():
        transformed_path.unlink()
    raise RuntimeError(msg) from exc
```
Also consider adding a similar cleanup for the transformation exception block (lines 129-133).
Broad Exception Handling (decompressor.py:129)
```
except Exception as exc:
    msg = f"Failed to transform record at line {record_count}: {exc}"
    raise RuntimeError(msg) from exc
```
Issue: Using bare Exception catches too much, including system exceptions like KeyboardInterrupt (though Python 3 moved it to BaseException).

Recommendation: Be more specific about what exceptions you expect. Based on transform_record, you'd likely get TypeError, KeyError, or AttributeError. Consider:
```
except (TypeError, KeyError, AttributeError, ValueError) as exc:
```

Medium Priority Issues

Type Safety in transform_record (decompressor.py:27-73)

The function performs runtime type checks (isinstance(metadata, dict)) but the type hints suggest record: dict is sufficient. The actual structure is more specific - it's a dict with known keys like "metadata", "alignedNucleotideSequences", etc.

Recommendation: Consider using TypedDict for better type safety:
```
from typing import TypedDict, NotRequired

class InputRecord(TypedDict):
    metadata: NotRequired[dict[str, Any]]
    alignedNucleotideSequences: NotRequired[dict[str, str | None]]
    # ... other fields

def transform_record(record: InputRecord) -> dict:
```
This would catch issues at development time rather than runtime.
Potential Data Loss in Hash Comparison (download_manager.py:187)

The code now compares hashes of transformed_path instead of the raw downloaded data. If the transformation logic has a bug that produces different output for the same input across runs, the hash check won't catch duplicate downloads of the same raw data.

Recommendation: Consider comparing the hash of the raw data file instead of the transformed one, or log both hashes for debugging purposes.
Missing Edge Case Handling in transform_record

The function doesn't handle cases where:
- record itself is not a dict (though it checks isinstance for nested fields)
- Keys have unexpected types (e.g., nucleotideInsertions.get(segment_key, []) assumes a list)
Example problematic input:
```
{
    "metadata": {"key": "id1"},
    "alignedNucleotideSequences": {"segment1": "ATCG"},
    "nucleotideInsertions": {"segment1": "not_a_list"}  # Bug\!
}
```
This would silently produce incorrect output rather than failing loudly.

Recommendation: Add validation or use .get(segment_key, []) with a type check.

Minor Issues

Inconsistent Error Message (download_manager.py:138)

The error message says "Failed to compress/decompress" but the original message in decompressor.py is more specific. Consider being more precise about which operation failed.
Test Helper Modification (tests/helpers.py:23)

The change from decompressor.decompress(handle.read()) to using stream_reader is good for consistency, but consider documenting why this change was made (probably to handle large files better).
Unused Constant (constants.py:9)

DATA_FILENAME is now unused after the refactoring. Consider removing it in a follow-up PR or documenting why it's kept.

🔒 Security Considerations

✅ Good: No obvious security vulnerabilities introduced.

Notes:

The transformation doesn't add sanitization, but this is acceptable since the data comes from an internal backend
Compressed data is handled safely through the zstandard library
No user-provided paths or code execution

⚡ Performance Considerations

✅ Excellent: The single-pass design is optimal for performance.

Additional observations:

Memory efficiency: Streaming approach processes one line at a time rather than loading entire files
I/O efficiency: Single read and single write operation
CPU trade-off: Simultaneous decompression and compression may use more CPU, but this is acceptable given the I/O savings

Potential optimization: Consider using zstandard.ZstdCompressor(level=...) to tune compression level based on your priorities (speed vs. size).

📋 Testing

Coverage: ✅ Good

Unit tests for transformation logic
Integration tests updated to verify end-to-end behavior
Edge cases covered (null values, populated data)

Missing test cases:

Malformed insertion data: What happens if nucleotideInsertions[segment_key] is not a list?
Large records: The streaming approach should handle very large records, but no test verifies this
Partial write failure: Test behavior when disk fills up mid-write
Non-dict metadata: Test with "metadata": null or "metadata": "string"

📝 Documentation

Missing:

Module docstring in decompressor.py:1 could mention the SILO 0.8 transformation requirement
transform_record would benefit from examples in the docstring showing the before/after format
No comment explaining why both raw and transformed files are kept (only transformed is used)

Suggestion:

def transform_record(record: dict) -> dict:
    """
    Transform a SILO 0.7 format record to SILO 0.8 format.
    
    SILO 0.8 expects a flattened structure where sequences and their
    insertions are combined into objects with 'sequence' and 'insertions' fields.
    
    Example:
        Input:
            {
                "metadata": {"accession": "SEQ1"},
                "alignedNucleotideSequences": {"main": "ATCG"},
                "nucleotideInsertions": {"main": ["123:A"]}
            }
        Output:
            {
                "accession": "SEQ1",
                "main": {
                    "sequence": "ATCG",
                    "insertions": ["123:A"]
                }
            }
    """

🎯 Alignment with Project Conventions

✅ The PR follows the project's conventions:

Proper conventional commit style in title (though could be more specific: "feat(silo-import): update to SILO 0.8.6 with data transformation")
Detailed PR description
Tests included and updated
Manual testing documented

🏁 Verdict

Recommendation: ✅ Approve with minor changes

This is a well-implemented PR that successfully adapts the codebase to SILO 0.8's requirements. The core transformation logic is sound and the single-pass architecture is elegant. The critical issues around resource cleanup and exception handling should be addressed before merging, but they're straightforward fixes.

Priority fixes before merge:

Add cleanup of partial transformed_path file on errors
Narrow exception handling in transformation block
Add validation for edge cases in transform_record

Nice-to-haves for follow-up:

Add TypedDict for better type safety
Enhance documentation with examples
Add test cases for malformed data
Consider logging both raw and transformed file hashes

Great work overall! The performance improvement from SILO 0.8 (90 min → 20 min for 9.3M sequences) will be a huge win for users. 🚀

silo-import/src/silo_import/decompressor.py

theosanderson

LGTM - one issue raised by Cornelius

fengelniederhammer · 2025-12-17T08:58:30Z

I'll merge this after Christmas so that I don't risk breaking Loculus main over the holidays. If you want to merge it earlier, feel free to go ahead.

anna-parker · 2025-12-17T09:10:50Z

silo-import/src/silo_import/decompressor.py

+
+def transform_record(record: dict) -> dict:
+    """
+    Transform a single JSON record.


we should link to the silo 0.8 data format requirements/readme here

anna-parker · 2025-12-17T09:29:03Z

Thought raised by @corneliusroemer why dont we just call the rust binary (it is probably faster)?

fengelniederhammer · 2025-12-17T09:30:56Z

Thought raised by @corneliusroemer why dont we just call the rust binary (it is probably faster)?

It's not published anywhere properly. I would leave that as a possible improvement to the future.

anna-parker · 2025-12-17T10:22:20Z

It's not published anywhere properly. I would leave that as a possible improvement to the future.

I've added it here: #5733

corneliusroemer · 2025-12-17T10:23:15Z

It's not published anywhere properly. I would leave that as a possible improvement to the future.

Why don't you publish it? What's the point of having it if it's not usable?

We can also just compile it in the dockerfile... cargo is pretty easy to use.

anna-parker · 2025-12-17T10:28:55Z

silo-import/src/silo_import/constants.py

 SPECIAL_ETAG_NONE = "0"

 # File names for downloaded data
 DATA_FILENAME = "data.ndjson.zst"


I think there might be some confusion between DATA_FILENAME and RAW_DATA_FILENAME? I would think DATA_FILENAME would be replaced by RAW_DATA_FILENAME?

What confusion? Why would it be replaced?

I mean this script shoudl just transform DATA_FILENAME into TRANSFORMED_DATA_FILENAME I dont think we need an intermediate file?

Previously there were two files:

/<dataVersion>/data.ndjson.zst

data.ndjson.zst (on top level)

Now there are three files:

in the data version dir: the raw downloaded file

in the data version dir: the transformed file

on top level: the transformed file that has been copied here

I found it too confusing that the original (untransformed) file has the same name as the output file because it hides that there has been a transformation in between. That's why I named the three files accordingly.

fengelniederhammer · 2025-12-17T10:31:32Z

It's not published anywhere properly. I would leave that as a possible improvement to the future.

Why don't you publish it? What's the point of having it if it's not usable?

We used it to migrate our test data. And we didn't want to throw it away because it might be useful but we didn't have the time to properly release it.

anna-parker · 2025-12-19T08:26:08Z

I am concerned that this is using python to do extensive ndjson parsing and processing - I currently do this in the ena deposition and due to the increased data volumns on production the cronjob keeps timing out (it has a full hour to process data): #5755.

We should test that this doesnt increase the time for SILO to update even further than it currently has been increased, see #5758

Things to test:

Switch to orjsonl for all ndjson parsing as this is faster: feat(silo-import): use orjsonl to increase speed of parsing ndjson an… #5764
test this on staging if we really want to use this in stead of the rust binary: chore(silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job #5733

fengelniederhammer · 2025-12-19T08:30:59Z

FWIW, the current state already reads and parses the full NDJSON, this PR only adds a transformation (should be cheap) and writing it back to a file (potentially expensive, but shouldn't add much more than the reading which we already do).

But I agree that it would be good to test this. 👍

fengelniederhammer added the preview Triggers a deployment to argocd label Dec 11, 2025

fengelniederhammer force-pushed the 4868-update-to-silo-08-and-adapt-input-data-for-silo branch from 084e057 to 14145cc Compare December 11, 2025 09:23

fengelniederhammer changed the base branch from main to format December 11, 2025 09:34

fengelniederhammer linked an issue Dec 11, 2025 that may be closed by this pull request

Update to SILO 0.8 and adapt input data for SILO #4868

Open

fengelniederhammer requested review from anna-parker, chaoran-chen, Copilot, corneliusroemer and theosanderson December 11, 2025 09:35

fengelniederhammer marked this pull request as ready for review December 11, 2025 09:35

Copilot AI reviewed Dec 11, 2025

View reviewed changes

fengelniederhammer changed the title ~~4868 silo 0.8.6~~ chore(kubernetes, silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job Dec 11, 2025

Base automatically changed from format to main December 11, 2025 10:45

fengelniederhammer added 9 commits December 11, 2025 11:46

4868 silo 0.8.6

a5733cb

4868 do it

3a906cf

4868 fix

3afcf5b

4868 I don't know why this works?

d16b06a

4868 fix lints

1065464

4868 fix test

d12c447

4868 add test

343bae8

4868 refactor test

a4f8b4d

4868 fix

39f4df1

fengelniederhammer force-pushed the 4868-update-to-silo-08-and-adapt-input-data-for-silo branch from 701330c to 39f4df1 Compare December 11, 2025 10:46

corneliusroemer reviewed Dec 15, 2025

View reviewed changes

silo-import/src/silo_import/decompressor.py Outdated Show resolved Hide resolved

theosanderson approved these changes Dec 16, 2025

View reviewed changes

4868 remove isinstance checks

174afea

anna-parker reviewed Dec 17, 2025

View reviewed changes

anna-parker mentioned this pull request Dec 17, 2025

chore(silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job #5733

Open

3 tasks

chore(kubernetes, silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job #5683

Are you sure you want to change the base?

chore(kubernetes, silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job #5683

Conversation

fengelniederhammer commented Dec 11, 2025 • edited by loculus-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Screenshot

PR Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude bot commented Dec 11, 2025

PR Review: Update to SILO 0.8.6 with Data Transformation

Summary

✅ Strengths

🔍 Issues & Recommendations

Critical Issues

Medium Priority Issues

Minor Issues

🔒 Security Considerations

⚡ Performance Considerations

📋 Testing

📝 Documentation

🎯 Alignment with Project Conventions

🏁 Verdict

Uh oh!

Uh oh!

theosanderson left a comment

Choose a reason for hiding this comment

Uh oh!

fengelniederhammer commented Dec 17, 2025

Uh oh!

anna-parker Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

anna-parker commented Dec 17, 2025

Uh oh!

fengelniederhammer commented Dec 17, 2025

Uh oh!

anna-parker commented Dec 17, 2025

Uh oh!

corneliusroemer commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anna-parker Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

fengelniederhammer Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

anna-parker Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

fengelniederhammer Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fengelniederhammer commented Dec 17, 2025

Uh oh!

anna-parker commented Dec 19, 2025

Uh oh!

fengelniederhammer commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fengelniederhammer commented Dec 11, 2025 •

edited by loculus-bot

Loading

corneliusroemer commented Dec 17, 2025 •

edited

Loading

fengelniederhammer Dec 17, 2025 •

edited

Loading