Skip to content

Conversation

@fengelniederhammer
Copy link
Contributor

@fengelniederhammer fengelniederhammer commented Dec 11, 2025

resolves #4868

An alternative approach to #4904. Instead of changing the format of the /get-released-data endpoint, we transform the data in the silo-import to the new input format that SILO requires as of 0.8.

The transformation and its test were copied (via an LLM) from https://github.com/GenSpectrum/LAPIS-SILO/blob/bb4bd655864ad41819873f0dd3102e9bfc6ddb61/tools/legacyNdjsonTransformer/main.rs.

Screenshot

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)
    • the preview contains ingested sequences

🚀 Preview: https://4868-update-to-silo-08-an.loculus.org

@fengelniederhammer fengelniederhammer added the preview Triggers a deployment to argocd label Dec 11, 2025
@fengelniederhammer fengelniederhammer force-pushed the 4868-update-to-silo-08-and-adapt-input-data-for-silo branch from 084e057 to 14145cc Compare December 11, 2025 09:23
@fengelniederhammer fengelniederhammer changed the base branch from main to format December 11, 2025 09:34
@fengelniederhammer fengelniederhammer linked an issue Dec 11, 2025 that may be closed by this pull request
@fengelniederhammer fengelniederhammer marked this pull request as ready for review December 11, 2025 09:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the SILO version to 0.8.6 and introduces a data transformation layer in the silo-import process. The key changes involve transforming downloaded NDJSON records from a nested structure (with separate metadata, sequences, and insertions fields) into a flattened format before passing them to SILO.

Key changes:

  • Implements transform_record() function to flatten and combine sequence data with insertions
  • Updates the import pipeline to perform transformation during the download/decompression phase
  • Adjusts test expectations to match the new transformed data format

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
silo-import/src/silo_import/decompressor.py Adds record transformation logic and integrates it into the analysis pipeline, creating a separate transformed output file
silo-import/src/silo_import/download_manager.py Updates download flow to use the new transformation function and handle transformed data paths
silo-import/src/silo_import/constants.py Adds constants for raw and transformed data filenames
silo-import/tests/test_decompressor.py Adds comprehensive tests for the new transform_record function
silo-import/tests/test_runner.py Updates test data structure to match transformed format
silo-import/tests/test_integration.py Updates integration tests to verify transformed output format
silo-import/tests/helpers.py Improves NDJSON reading to use streaming decompression
kubernetes/loculus/values.yaml Updates SILO image version to 0.8.6

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

raise RuntimeError(msg) from exc

transformed_json = json.dumps(transformed, separators=(",", ":"))
writer.write(f"{transformed_json}\n".encode())
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encoding each line individually creates multiple small byte objects. Consider encoding the entire string (including newline) in one operation, or accumulate lines in a buffer and write in batches for better performance when processing large files.

Copilot uses AI. Check for mistakes.
@claude
Copy link
Contributor

claude bot commented Dec 11, 2025

PR Review: Update to SILO 0.8.6 with Data Transformation

Summary

This PR successfully updates the LAPIS-SILO version from 0.7.10.2 to 0.8.6 and implements the required data transformation to adapt to SILO 0.8's breaking changes. The implementation introduces a single-pass decompression-transformation-compression pipeline that efficiently transforms NDJSON records into the new format expected by SILO 0.8.


✅ Strengths

  1. Efficient Single-Pass Architecture: The analyze_and_transform_ndjson function cleverly combines decompression, analysis, and transformation in a single pass, minimizing I/O overhead and memory usage.

  2. Comprehensive Test Coverage:

    • New unit tests for transform_record covering both null values and populated data
    • Updated integration tests to verify transformed output format
    • All tests properly adapted to expect the new transformed format
  3. Clean Separation of Concerns: The transform_record function is well-isolated, making it easy to test and understand the transformation logic.

  4. Backward Compatibility: The PR maintains the old DATA_FILENAME constant while introducing new ones (RAW_DATA_FILENAME, TRANSFORMED_DATA_FILENAME), showing careful consideration for the codebase structure.

  5. Error Handling: Transformation errors are caught and wrapped with helpful context including the line number where the error occurred (line 132 in decompressor.py:132).


🔍 Issues & Recommendations

Critical Issues

  1. Resource Cleanup Risk in analyze_and_transform_ndjson (decompressor.py:102-107)

    The function opens multiple resources using a grouped context manager:

    with (
        path.open("rb") as compressed_input,
        decompressor.stream_reader(compressed_input) as reader,
        transformed_path.open("wb") as compressed_output,
        compressor.stream_writer(compressed_output) as writer,
    ):

    Issue: If the transformation fails mid-stream, the output file will be left in a partially written state. The except zstandard.ZstdError block on line 137 doesn't clean up the incomplete transformed_path file.

    Recommendation: Add cleanup in the exception handler:

    except zstandard.ZstdError as exc:
        msg = f"Failed to compress/decompress {path}: {exc}"
        if transformed_path.exists():
            transformed_path.unlink()
        raise RuntimeError(msg) from exc

    Also consider adding a similar cleanup for the transformation exception block (lines 129-133).

  2. Broad Exception Handling (decompressor.py:129)

    except Exception as exc:
        msg = f"Failed to transform record at line {record_count}: {exc}"
        raise RuntimeError(msg) from exc

    Issue: Using bare Exception catches too much, including system exceptions like KeyboardInterrupt (though Python 3 moved it to BaseException).

    Recommendation: Be more specific about what exceptions you expect. Based on transform_record, you'd likely get TypeError, KeyError, or AttributeError. Consider:

    except (TypeError, KeyError, AttributeError, ValueError) as exc:

Medium Priority Issues

  1. Type Safety in transform_record (decompressor.py:27-73)

    The function performs runtime type checks (isinstance(metadata, dict)) but the type hints suggest record: dict is sufficient. The actual structure is more specific - it's a dict with known keys like "metadata", "alignedNucleotideSequences", etc.

    Recommendation: Consider using TypedDict for better type safety:

    from typing import TypedDict, NotRequired
    
    class InputRecord(TypedDict):
        metadata: NotRequired[dict[str, Any]]
        alignedNucleotideSequences: NotRequired[dict[str, str | None]]
        # ... other fields
    
    def transform_record(record: InputRecord) -> dict:

    This would catch issues at development time rather than runtime.

  2. Potential Data Loss in Hash Comparison (download_manager.py:187)

    The code now compares hashes of transformed_path instead of the raw downloaded data. If the transformation logic has a bug that produces different output for the same input across runs, the hash check won't catch duplicate downloads of the same raw data.

    Recommendation: Consider comparing the hash of the raw data file instead of the transformed one, or log both hashes for debugging purposes.

  3. Missing Edge Case Handling in transform_record

    The function doesn't handle cases where:

    • record itself is not a dict (though it checks isinstance for nested fields)
    • Keys have unexpected types (e.g., nucleotideInsertions.get(segment_key, []) assumes a list)

    Example problematic input:

    {
        "metadata": {"key": "id1"},
        "alignedNucleotideSequences": {"segment1": "ATCG"},
        "nucleotideInsertions": {"segment1": "not_a_list"}  # Bug\!
    }

    This would silently produce incorrect output rather than failing loudly.

    Recommendation: Add validation or use .get(segment_key, []) with a type check.

Minor Issues

  1. Inconsistent Error Message (download_manager.py:138)

    The error message says "Failed to compress/decompress" but the original message in decompressor.py is more specific. Consider being more precise about which operation failed.

  2. Test Helper Modification (tests/helpers.py:23)

    The change from decompressor.decompress(handle.read()) to using stream_reader is good for consistency, but consider documenting why this change was made (probably to handle large files better).

  3. Unused Constant (constants.py:9)

    DATA_FILENAME is now unused after the refactoring. Consider removing it in a follow-up PR or documenting why it's kept.


🔒 Security Considerations

Good: No obvious security vulnerabilities introduced.

Notes:

  • The transformation doesn't add sanitization, but this is acceptable since the data comes from an internal backend
  • Compressed data is handled safely through the zstandard library
  • No user-provided paths or code execution

⚡ Performance Considerations

Excellent: The single-pass design is optimal for performance.

Additional observations:

  • Memory efficiency: Streaming approach processes one line at a time rather than loading entire files
  • I/O efficiency: Single read and single write operation
  • CPU trade-off: Simultaneous decompression and compression may use more CPU, but this is acceptable given the I/O savings

Potential optimization: Consider using zstandard.ZstdCompressor(level=...) to tune compression level based on your priorities (speed vs. size).


📋 Testing

Coverage: ✅ Good

  • Unit tests for transformation logic
  • Integration tests updated to verify end-to-end behavior
  • Edge cases covered (null values, populated data)

Missing test cases:

  1. Malformed insertion data: What happens if nucleotideInsertions[segment_key] is not a list?
  2. Large records: The streaming approach should handle very large records, but no test verifies this
  3. Partial write failure: Test behavior when disk fills up mid-write
  4. Non-dict metadata: Test with "metadata": null or "metadata": "string"

📝 Documentation

Missing:

  1. Module docstring in decompressor.py:1 could mention the SILO 0.8 transformation requirement
  2. transform_record would benefit from examples in the docstring showing the before/after format
  3. No comment explaining why both raw and transformed files are kept (only transformed is used)

Suggestion:

def transform_record(record: dict) -> dict:
    """
    Transform a SILO 0.7 format record to SILO 0.8 format.
    
    SILO 0.8 expects a flattened structure where sequences and their
    insertions are combined into objects with 'sequence' and 'insertions' fields.
    
    Example:
        Input:
            {
                "metadata": {"accession": "SEQ1"},
                "alignedNucleotideSequences": {"main": "ATCG"},
                "nucleotideInsertions": {"main": ["123:A"]}
            }
        Output:
            {
                "accession": "SEQ1",
                "main": {
                    "sequence": "ATCG",
                    "insertions": ["123:A"]
                }
            }
    """

🎯 Alignment with Project Conventions

✅ The PR follows the project's conventions:

  • Proper conventional commit style in title (though could be more specific: "feat(silo-import): update to SILO 0.8.6 with data transformation")
  • Detailed PR description
  • Tests included and updated
  • Manual testing documented

🏁 Verdict

Recommendation: ✅ Approve with minor changes

This is a well-implemented PR that successfully adapts the codebase to SILO 0.8's requirements. The core transformation logic is sound and the single-pass architecture is elegant. The critical issues around resource cleanup and exception handling should be addressed before merging, but they're straightforward fixes.

Priority fixes before merge:

  1. Add cleanup of partial transformed_path file on errors
  2. Narrow exception handling in transformation block
  3. Add validation for edge cases in transform_record

Nice-to-haves for follow-up:

  • Add TypedDict for better type safety
  • Enhance documentation with examples
  • Add test cases for malformed data
  • Consider logging both raw and transformed file hashes

Great work overall! The performance improvement from SILO 0.8 (90 min → 20 min for 9.3M sequences) will be a huge win for users. 🚀

@fengelniederhammer fengelniederhammer changed the title 4868 silo 0.8.6 chore(kubernetes, silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job Dec 11, 2025
Base automatically changed from format to main December 11, 2025 10:45
@fengelniederhammer fengelniederhammer force-pushed the 4868-update-to-silo-08-and-adapt-input-data-for-silo branch from 701330c to 39f4df1 Compare December 11, 2025 10:46
Copy link
Member

@theosanderson theosanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - one issue raised by Cornelius

@fengelniederhammer
Copy link
Contributor Author

I'll merge this after Christmas so that I don't risk breaking Loculus main over the holidays. If you want to merge it earlier, feel free to go ahead.


def transform_record(record: dict) -> dict:
"""
Transform a single JSON record.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should link to the silo 0.8 data format requirements/readme here

@anna-parker
Copy link
Contributor

Thought raised by @corneliusroemer why dont we just call the rust binary (it is probably faster)?

@fengelniederhammer
Copy link
Contributor Author

Thought raised by @corneliusroemer why dont we just call the rust binary (it is probably faster)?

It's not published anywhere properly. I would leave that as a possible improvement to the future.

@anna-parker
Copy link
Contributor

It's not published anywhere properly. I would leave that as a possible improvement to the future.

I've added it here: #5733

@corneliusroemer
Copy link
Contributor

corneliusroemer commented Dec 17, 2025

It's not published anywhere properly. I would leave that as a possible improvement to the future.

Why don't you publish it? What's the point of having it if it's not usable?

We can also just compile it in the dockerfile... cargo is pretty easy to use.

SPECIAL_ETAG_NONE = "0"

# File names for downloaded data
DATA_FILENAME = "data.ndjson.zst"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be some confusion between DATA_FILENAME and RAW_DATA_FILENAME? I would think DATA_FILENAME would be replaced by RAW_DATA_FILENAME?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What confusion? Why would it be replaced?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean this script shoudl just transform DATA_FILENAME into TRANSFORMED_DATA_FILENAME I dont think we need an intermediate file?

Copy link
Contributor Author

@fengelniederhammer fengelniederhammer Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously there were two files:

  • /<dataVersion>/data.ndjson.zst
  • data.ndjson.zst (on top level)

Now there are three files:

  • in the data version dir: the raw downloaded file
  • in the data version dir: the transformed file
  • on top level: the transformed file that has been copied here

I found it too confusing that the original (untransformed) file has the same name as the output file because it hides that there has been a transformation in between. That's why I named the three files accordingly.

@fengelniederhammer
Copy link
Contributor Author

It's not published anywhere properly. I would leave that as a possible improvement to the future.

Why don't you publish it? What's the point of having it if it's not usable?

We used it to migrate our test data. And we didn't want to throw it away because it might be useful but we didn't have the time to properly release it.

@anna-parker
Copy link
Contributor

I am concerned that this is using python to do extensive ndjson parsing and processing - I currently do this in the ena deposition and due to the increased data volumns on production the cronjob keeps timing out (it has a full hour to process data): #5755.

We should test that this doesnt increase the time for SILO to update even further than it currently has been increased, see #5758

Things to test:

  1. Switch to orjsonl for all ndjson parsing as this is faster: feat(silo-import): use orjsonl to increase speed of parsing ndjson an… #5764
  2. test this on staging if we really want to use this in stead of the rust binary: chore(silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job #5733

@fengelniederhammer
Copy link
Contributor Author

FWIW, the current state already reads and parses the full NDJSON, this PR only adds a transformation (should be cheap) and writing it back to a file (potentially expensive, but shouldn't add much more than the reading which we already do).

But I agree that it would be good to test this. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview Triggers a deployment to argocd

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update to SILO 0.8 and adapt input data for SILO

5 participants