feat(deployment): validate nextclade preprocessing configFile schema #5678

theosanderson · 2025-12-10T14:46:22Z

Adds a bit of the values.yaml JSON schema that checks that the config settings applied to the nextclade-preprocessing pipeline are valid. This is quite vibe-coded but the fact that it passes suggests it's mostly good. The alternative implementation would be to build this into the preprocessing pipeline itself. That would be more consistent with how custom user pipelines would need to work, but less consistent with our other validation. For now, merging this PR probably makes sense to me.

Claude's description

Add conditional JSON schema validation for the nextclade preprocessing pipeline configFile. Validation is only applied when the image contains "preprocessing-nextclade", allowing other pipelines to have flexible configuration.

For nextclade pipelines:

Add additionalProperties: false to configFile to reject unexpected keys and catch typos
Add additionalProperties: false to nextclade_sequence_and_datasets items
Add items.type: string to accepted_sort_matches and genes arrays
Add missing config properties: batch_size (with minimum: 1)
Add EMBL file generation options: create_embl_file, scientific_name, molecule_type (enum), topology (enum), db_name, and embl object
Require nextclade_sequence_and_datasets field

For other pipelines (e.g., preprocessing-dummy):

No strict validation applied, allowing flexible configuration

🚀 Preview: Add preview label to enable

claude · 2025-12-10T14:59:11Z

Code Review for PR #5678

I've reviewed the changes to add conditional JSON schema validation for the nextclade preprocessing pipeline. Overall, this is a solid improvement that will help catch configuration errors early. Here are my findings:

✅ Strengths

Smart Conditional Validation: The use of allOf with if/then to conditionally apply strict validation only to nextclade pipelines is elegant and allows flexibility for other pipeline types.
Comprehensive Coverage: The schema now covers all config fields defined in preprocessing/nextclade/src/loculus_preprocessing/config.py, including the recently added EMBL generation features.
Strict Validation: Adding additionalProperties: false at multiple levels will effectively catch typos and invalid configuration keys.

🔍 Issues Found

1. Missing `log_level` enum values (Medium Priority)

The schema defines:

"log_level": {
  "enum": ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]
}

However, in config.py:56, the default is "DEBUG" (string), and the code at line 186-188 reads from environment variable PREPROCESSING_LOG_LEVEL. The schema correctly lists Python logging levels, but should verify this matches Python's logging module expectations. ✅ This appears correct.

2. Type Mismatch: `embl` field (Low Priority)

Location: values.schema.json:653-672

The schema defines embl.admin_level_properties as:

"admin_level_properties": {
  "type": "array",
  "items": {
    "type": "string"
  }
}

In config.py:24-26, the default is:

admin_level_properties: list[str] = dataclasses.field(
    default_factory=lambda: ["geoLocAdmin1", "geoLocAdmin2", "geoLocCity", "geoLocSite"]
)

This is correct, but the schema should add "description" for the array items similar to other array fields for consistency.

3. Missing Fields from `config.py` (High Priority)

The schema is missing several fields that exist in the Config dataclass:

processing_spec (config.py:72) - This is a dict[str, dict[str, Any]] and may be important for validation
Fields related to authentication (lines 62-66): backend_host, keycloak_host, keycloak_user, keycloak_password, keycloak_token_path, organism, pipeline_version

However, these might be intentionally omitted from the schema if they're set elsewhere (environment variables, CLI args). Could you clarify if these should be in configFile or are managed separately?

4. Inconsistent Enum String Values (High Priority)

Location: values.schema.json:607-610

The molecule_type enum uses:

"enum": ["genomic DNA", "genomic RNA", "viral cRNA"]

But in datatypes.py:218-221, the enum is:

class MoleculeType(StrEnum):
    GENOMIC_DNA = "genomic DNA"
    GENOMIC_RNA = "genomic RNA"
    VIRAL_CRNA = "viral cRNA"  # ⚠️ Note: "viral cRNA" not "VIRAL_CRNA"

✅ Schema correctly uses the string values, not the enum keys. Good!

📋 Recommendations

Must Fix:

Clarify scope of validation: Document which config fields are expected in configFile vs. set via environment variables or other means. The schema now requires nextclade_sequence_and_datasets but this field has a default value in the code.

Should Consider:

Add schema version: Consider adding a $schema or schemaVersion field to track schema evolution.
Test coverage: Add integration tests that verify:
- Valid nextclade configs pass validation
- Invalid configs (typos, wrong types) are properly rejected
- Other pipeline types (preprocessing-dummy) aren't affected by strict validation
Documentation: Update the preprocessing pipeline documentation to reference this schema validation, especially for users creating custom pipelines.
Validation error messages: Test what error messages users see when validation fails - JSON schema errors can be cryptic.

🔐 Security

No security concerns identified. The schema validation will actually help prevent injection of unexpected configuration values.

⚡ Performance

Schema validation is negligible performance overhead during deployment/startup, which is appropriate.

✨ Style & Conventions

The PR follows the conventional commit format correctly: feat(deployment): validate nextclade preprocessing configFile schema. The detailed PR description is excellent and follows the project's guidelines for detailed summaries.

🧪 Testing

Missing: No test coverage mentioned for the schema validation itself. Consider adding tests in the deployment/kubernetes test suite or integration tests that verify:

Helm chart validation with valid configs
Helm chart rejection with invalid configs

Overall Assessment: This is a valuable addition that will catch configuration errors early. The conditional validation approach is well-designed. Main concerns are around ensuring all necessary fields are included and adding test coverage for the validation logic.

The "vibe-coded" nature mentioned in the PR description suggests this would benefit from systematic testing to ensure completeness.

🚀 Preview link looks good for manual validation!

Add conditional JSON schema validation for the nextclade preprocessing pipeline configFile. Validation is only applied when the image contains "preprocessing-nextclade", allowing other pipelines to have flexible configuration. For nextclade pipelines: - Add `additionalProperties: false` to configFile to reject unexpected keys and catch typos - Add `additionalProperties: false` to nextclade_sequence_and_datasets items and embl object - Add `items.type: string` to `accepted_sort_matches` and `genes` arrays - Add config properties: `batch_size`, `log_level` (with enum validation) - Add EMBL file generation options: `create_embl_file`, `scientific_name`, `molecule_type` (enum), `topology` (enum), `db_name`, and `embl` object - Require `nextclade_sequence_and_datasets` field For other pipelines (e.g., preprocessing-dummy): - No strict validation applied, allowing flexible configuration The nextclade-specific schema is defined as a separate definition (`nextcladeConfigFile`) for documentation, while the if/then conditional inlines the property names for helm compatibility.

theosanderson · 2025-12-10T15:23:41Z

kubernetes/loculus/values.schema.json

      "required": ["repository", "pullPolicy"],
      "additionalProperties": false
+    },
+    "nextcladeConfigFile": {


This whole section is only needed so it shows up in the docs.. maybe this could be done another way to avoid code duplication but I can't right now. (Or we could just ignore the docs on this aspect)

claude · 2025-12-10T15:27:34Z

Code Review for PR #5678

I've reviewed this PR that adds conditional JSON schema validation for the nextclade preprocessing pipeline. Overall, this is a valuable improvement that will catch configuration errors early. Here are my findings:

✅ Strengths

Smart conditional validation: The use of allOf with if/then to conditionally apply strict validation only to nextclade pipelines is a clean approach that allows flexibility for custom pipelines like preprocessing-dummy.
Improved error detection: Adding additionalProperties: false at multiple levels will help catch typos and unexpected configuration keys, which is the main goal of this PR.
Comprehensive coverage: The schema now includes previously undocumented fields like batch_size, EMBL file generation options, and log_level.
Array item validation: Adding items.type: string to accepted_sort_matches and genes arrays provides proper type safety.

claude · 2025-12-10T15:27:41Z

🐛 Issues Found

1. Critical: Field name inconsistency (config.py:144)

There's a mismatch between the schema and the Python code:

Schema uses: "embl" (line 652 in new schema)
Python code expects: "embl_info" (preprocessing/nextclade/src/loculus_preprocessing/config.py:144)

The Python code has: if key == "embl_info" and isinstance(value, dict):

This means the embl configuration will never be properly loaded from the config file. The schema field name should be changed to "embl_info" to match the Python code, or the Python code should be updated to use "embl".

Recommendation: Change the schema to use "embl_info" since that's what the Python code expects, or coordinate a change in both places if you prefer "embl".

claude · 2025-12-10T15:27:47Z

2. Missing validation for empty properties

In the then clause (lines 605-638), the properties are defined with empty objects {}. While this allows the properties to exist, it loses all the detailed validation from the nextcladeConfigFile definition. This means:

Type checking is lost
Enum validation is lost
Descriptions are lost

For example, alignment_requirement should still validate as an enum even when inside the conditional.

Recommendation: Consider using $ref to reference the definitions from nextcladeConfigFile, or duplicate the validation rules.

claude · 2025-12-10T15:27:57Z

3. Incomplete EMBL validation (line 652)

The admin_level_properties field in the schema doesn't specify items.type: "string", even though it's an array of strings according to the Python code (config.py:24-26).

Fix needed: Add proper array item validation for admin_level_properties.

⚠️ Minor Concerns

Schema duplication: The field definitions now exist in two places - the nextcladeConfigFile definition and inline within the preprocessing pipeline schema. This creates maintenance burden.
Missing minimum constraint propagation: The batch_size field has minimum: 1 in the nextcladeConfigFile definition but this constraint isn't present in the conditional validation (line 626).
Documentation consistency: The description on line 587 references nextcladeConfigFile but the link format may not render correctly.

claude · 2025-12-10T15:28:03Z

🧪 Testing Recommendations

Test with a config that has embl vs embl_info to verify which one actually works
Test with a typo in a field name to ensure additionalProperties: false catches it
Test with preprocessing-dummy to ensure it's not affected by the strict validation
Test with invalid enum values (e.g., molecule_type: "invalid") to ensure validation works

📝 Summary

This PR provides valuable validation, but the embl vs embl_info naming inconsistency is a blocking issue that will prevent EMBL configurations from working correctly. Please fix this before merging. The empty property definitions in the conditional validation also reduce the effectiveness of the schema validation.

Verdict: Request changes - fix the embl/embl_info inconsistency and consider improving the conditional validation to preserve type/enum checks.

theosanderson force-pushed the claude/validate-nextclade-config-0162xFgMy6awa88njhjyiXU6 branch from 20e7c53 to a9b7c48 Compare December 10, 2025 14:48

theosanderson marked this pull request as ready for review December 10, 2025 14:56

theosanderson added the preview Triggers a deployment to argocd label Dec 10, 2025

theosanderson force-pushed the claude/validate-nextclade-config-0162xFgMy6awa88njhjyiXU6 branch 2 times, most recently from 608cf1d to 9b2a447 Compare December 10, 2025 15:07

theosanderson marked this pull request as draft December 10, 2025 15:10

theosanderson force-pushed the claude/validate-nextclade-config-0162xFgMy6awa88njhjyiXU6 branch from af90c4e to 1a9873b Compare December 10, 2025 15:12

theosanderson added 2 commits December 10, 2025 15:19

break

0b0f176

unbreak

ae2fe8a

theosanderson commented Dec 10, 2025

View reviewed changes

theosanderson marked this pull request as ready for review December 10, 2025 15:24

theosanderson marked this pull request as draft December 10, 2025 15:39

anna-parker removed the preview Triggers a deployment to argocd label Dec 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(deployment): validate nextclade preprocessing configFile schema #5678

feat(deployment): validate nextclade preprocessing configFile schema #5678

Uh oh!

theosanderson commented Dec 10, 2025 •

edited by loculus-bot

Loading

Uh oh!

claude bot commented Dec 10, 2025

Uh oh!

theosanderson Dec 10, 2025

Uh oh!

claude bot commented Dec 10, 2025

Uh oh!

claude bot commented Dec 10, 2025

Uh oh!

claude bot commented Dec 10, 2025

Uh oh!

claude bot commented Dec 10, 2025

Uh oh!

claude bot commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(deployment): validate nextclade preprocessing configFile schema #5678

Are you sure you want to change the base?

feat(deployment): validate nextclade preprocessing configFile schema #5678

Uh oh!

Conversation

theosanderson commented Dec 10, 2025 • edited by loculus-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude's description

Uh oh!

claude bot commented Dec 10, 2025

Code Review for PR #5678

✅ Strengths

🔍 Issues Found

1. Missing log_level enum values (Medium Priority)

2. Type Mismatch: embl field (Low Priority)

3. Missing Fields from config.py (High Priority)

4. Inconsistent Enum String Values (High Priority)

📋 Recommendations

Must Fix:

Should Consider:

🔐 Security

⚡ Performance

✨ Style & Conventions

🧪 Testing

Uh oh!

theosanderson Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot commented Dec 10, 2025

Code Review for PR #5678

✅ Strengths

Uh oh!

claude bot commented Dec 10, 2025

🐛 Issues Found

1. Critical: Field name inconsistency (config.py:144)

Uh oh!

claude bot commented Dec 10, 2025

2. Missing validation for empty properties

Uh oh!

claude bot commented Dec 10, 2025

3. Incomplete EMBL validation (line 652)

⚠️ Minor Concerns

Uh oh!

claude bot commented Dec 10, 2025

🧪 Testing Recommendations

📝 Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

theosanderson commented Dec 10, 2025 •

edited by loculus-bot

Loading

1. Missing `log_level` enum values (Medium Priority)

2. Type Mismatch: `embl` field (Low Priority)

3. Missing Fields from `config.py` (High Priority)