-
Notifications
You must be signed in to change notification settings - Fork 9
chore(kubernetes, silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job #5683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
chore(kubernetes, silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job #5683
Conversation
084e057 to
14145cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR updates the SILO version to 0.8.6 and introduces a data transformation layer in the silo-import process. The key changes involve transforming downloaded NDJSON records from a nested structure (with separate metadata, sequences, and insertions fields) into a flattened format before passing them to SILO.
Key changes:
- Implements
transform_record()function to flatten and combine sequence data with insertions - Updates the import pipeline to perform transformation during the download/decompression phase
- Adjusts test expectations to match the new transformed data format
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| silo-import/src/silo_import/decompressor.py | Adds record transformation logic and integrates it into the analysis pipeline, creating a separate transformed output file |
| silo-import/src/silo_import/download_manager.py | Updates download flow to use the new transformation function and handle transformed data paths |
| silo-import/src/silo_import/constants.py | Adds constants for raw and transformed data filenames |
| silo-import/tests/test_decompressor.py | Adds comprehensive tests for the new transform_record function |
| silo-import/tests/test_runner.py | Updates test data structure to match transformed format |
| silo-import/tests/test_integration.py | Updates integration tests to verify transformed output format |
| silo-import/tests/helpers.py | Improves NDJSON reading to use streaming decompression |
| kubernetes/loculus/values.yaml | Updates SILO image version to 0.8.6 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| raise RuntimeError(msg) from exc | ||
|
|
||
| transformed_json = json.dumps(transformed, separators=(",", ":")) | ||
| writer.write(f"{transformed_json}\n".encode()) |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encoding each line individually creates multiple small byte objects. Consider encoding the entire string (including newline) in one operation, or accumulate lines in a buffer and write in batches for better performance when processing large files.
PR Review: Update to SILO 0.8.6 with Data TransformationSummaryThis PR successfully updates the LAPIS-SILO version from 0.7.10.2 to 0.8.6 and implements the required data transformation to adapt to SILO 0.8's breaking changes. The implementation introduces a single-pass decompression-transformation-compression pipeline that efficiently transforms NDJSON records into the new format expected by SILO 0.8. ✅ Strengths
🔍 Issues & RecommendationsCritical Issues
Medium Priority Issues
Minor Issues
🔒 Security Considerations✅ Good: No obvious security vulnerabilities introduced. Notes:
⚡ Performance Considerations✅ Excellent: The single-pass design is optimal for performance. Additional observations:
Potential optimization: Consider using 📋 TestingCoverage: ✅ Good
Missing test cases:
📝 DocumentationMissing:
Suggestion: def transform_record(record: dict) -> dict:
"""
Transform a SILO 0.7 format record to SILO 0.8 format.
SILO 0.8 expects a flattened structure where sequences and their
insertions are combined into objects with 'sequence' and 'insertions' fields.
Example:
Input:
{
"metadata": {"accession": "SEQ1"},
"alignedNucleotideSequences": {"main": "ATCG"},
"nucleotideInsertions": {"main": ["123:A"]}
}
Output:
{
"accession": "SEQ1",
"main": {
"sequence": "ATCG",
"insertions": ["123:A"]
}
}
"""🎯 Alignment with Project Conventions✅ The PR follows the project's conventions:
🏁 VerdictRecommendation: ✅ Approve with minor changes This is a well-implemented PR that successfully adapts the codebase to SILO 0.8's requirements. The core transformation logic is sound and the single-pass architecture is elegant. The critical issues around resource cleanup and exception handling should be addressed before merging, but they're straightforward fixes. Priority fixes before merge:
Nice-to-haves for follow-up:
Great work overall! The performance improvement from SILO 0.8 (90 min → 20 min for 9.3M sequences) will be a huge win for users. 🚀 |
701330c to
39f4df1
Compare
theosanderson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - one issue raised by Cornelius
|
I'll merge this after Christmas so that I don't risk breaking Loculus main over the holidays. If you want to merge it earlier, feel free to go ahead. |
|
|
||
| def transform_record(record: dict) -> dict: | ||
| """ | ||
| Transform a single JSON record. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should link to the silo 0.8 data format requirements/readme here
|
Thought raised by @corneliusroemer why dont we just call the rust binary (it is probably faster)? |
It's not published anywhere properly. I would leave that as a possible improvement to the future. |
I've added it here: #5733 |
Why don't you publish it? What's the point of having it if it's not usable? We can also just compile it in the dockerfile... cargo is pretty easy to use. |
| SPECIAL_ETAG_NONE = "0" | ||
|
|
||
| # File names for downloaded data | ||
| DATA_FILENAME = "data.ndjson.zst" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be some confusion between DATA_FILENAME and RAW_DATA_FILENAME? I would think DATA_FILENAME would be replaced by RAW_DATA_FILENAME?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What confusion? Why would it be replaced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean this script shoudl just transform DATA_FILENAME into TRANSFORMED_DATA_FILENAME I dont think we need an intermediate file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously there were two files:
/<dataVersion>/data.ndjson.zst- data.ndjson.zst (on top level)
Now there are three files:
- in the data version dir: the raw downloaded file
- in the data version dir: the transformed file
- on top level: the transformed file that has been copied here
I found it too confusing that the original (untransformed) file has the same name as the output file because it hides that there has been a transformation in between. That's why I named the three files accordingly.
We used it to migrate our test data. And we didn't want to throw it away because it might be useful but we didn't have the time to properly release it. |
|
I am concerned that this is using python to do extensive ndjson parsing and processing - I currently do this in the ena deposition and due to the increased data volumns on production the cronjob keeps timing out (it has a full hour to process data): #5755. We should test that this doesnt increase the time for SILO to update even further than it currently has been increased, see #5758 Things to test:
|
|
FWIW, the current state already reads and parses the full NDJSON, this PR only adds a transformation (should be cheap) and writing it back to a file (potentially expensive, but shouldn't add much more than the reading which we already do). But I agree that it would be good to test this. 👍 |
resolves #4868
An alternative approach to #4904. Instead of changing the format of the
/get-released-dataendpoint, we transform the data in the silo-import to the new input format that SILO requires as of 0.8.The transformation and its test were copied (via an LLM) from https://github.com/GenSpectrum/LAPIS-SILO/blob/bb4bd655864ad41819873f0dd3102e9bfc6ddb61/tools/legacyNdjsonTransformer/main.rs.
Screenshot
PR Checklist
🚀 Preview: https://4868-update-to-silo-08-an.loculus.org