-
Notifications
You must be signed in to change notification settings - Fork 9
feat(deployment): validate nextclade preprocessing configFile schema #5678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(deployment): validate nextclade preprocessing configFile schema #5678
Conversation
20e7c53 to
a9b7c48
Compare
Code Review for PR #5678I've reviewed the changes to add conditional JSON schema validation for the nextclade preprocessing pipeline. Overall, this is a solid improvement that will help catch configuration errors early. Here are my findings: ✅ Strengths
🔍 Issues Found1. Missing
|
608cf1d to
9b2a447
Compare
Add conditional JSON schema validation for the nextclade preprocessing pipeline configFile. Validation is only applied when the image contains "preprocessing-nextclade", allowing other pipelines to have flexible configuration. For nextclade pipelines: - Add `additionalProperties: false` to configFile to reject unexpected keys and catch typos - Add `additionalProperties: false` to nextclade_sequence_and_datasets items and embl object - Add `items.type: string` to `accepted_sort_matches` and `genes` arrays - Add config properties: `batch_size`, `log_level` (with enum validation) - Add EMBL file generation options: `create_embl_file`, `scientific_name`, `molecule_type` (enum), `topology` (enum), `db_name`, and `embl` object - Require `nextclade_sequence_and_datasets` field For other pipelines (e.g., preprocessing-dummy): - No strict validation applied, allowing flexible configuration The nextclade-specific schema is defined as a separate definition (`nextcladeConfigFile`) for documentation, while the if/then conditional inlines the property names for helm compatibility.
af90c4e to
1a9873b
Compare
| "required": ["repository", "pullPolicy"], | ||
| "additionalProperties": false | ||
| }, | ||
| "nextcladeConfigFile": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole section is only needed so it shows up in the docs.. maybe this could be done another way to avoid code duplication but I can't right now. (Or we could just ignore the docs on this aspect)
Code Review for PR #5678I've reviewed this PR that adds conditional JSON schema validation for the nextclade preprocessing pipeline. Overall, this is a valuable improvement that will catch configuration errors early. Here are my findings: ✅ Strengths
|
🐛 Issues Found1. Critical: Field name inconsistency (config.py:144)There's a mismatch between the schema and the Python code:
The Python code has: if key == "embl_info" and isinstance(value, dict): This means the embl configuration will never be properly loaded from the config file. The schema field name should be changed to "embl_info" to match the Python code, or the Python code should be updated to use "embl". Recommendation: Change the schema to use "embl_info" since that's what the Python code expects, or coordinate a change in both places if you prefer "embl". |
2. Missing validation for empty propertiesIn the then clause (lines 605-638), the properties are defined with empty objects {}. While this allows the properties to exist, it loses all the detailed validation from the nextcladeConfigFile definition. This means:
For example, alignment_requirement should still validate as an enum even when inside the conditional. Recommendation: Consider using $ref to reference the definitions from nextcladeConfigFile, or duplicate the validation rules. |
3. Incomplete EMBL validation (line 652)The admin_level_properties field in the schema doesn't specify items.type: "string", even though it's an array of strings according to the Python code (config.py:24-26). Fix needed: Add proper array item validation for admin_level_properties.
|
🧪 Testing Recommendations
📝 SummaryThis PR provides valuable validation, but the embl vs embl_info naming inconsistency is a blocking issue that will prevent EMBL configurations from working correctly. Please fix this before merging. The empty property definitions in the conditional validation also reduce the effectiveness of the schema validation. Verdict: Request changes - fix the embl/embl_info inconsistency and consider improving the conditional validation to preserve type/enum checks. |
Adds a bit of the values.yaml JSON schema that checks that the config settings applied to the nextclade-preprocessing pipeline are valid. This is quite vibe-coded but the fact that it passes suggests it's mostly good. The alternative implementation would be to build this into the preprocessing pipeline itself. That would be more consistent with how custom user pipelines would need to work, but less consistent with our other validation. For now, merging this PR probably makes sense to me.
Claude's description
Add conditional JSON schema validation for the nextclade preprocessing pipeline configFile. Validation is only applied when the image contains "preprocessing-nextclade", allowing other pipelines to have flexible configuration.
For nextclade pipelines:
additionalProperties: falseto configFile to reject unexpected keys and catch typosadditionalProperties: falseto nextclade_sequence_and_datasets itemsitems.type: stringtoaccepted_sort_matchesandgenesarraysbatch_size(with minimum: 1)create_embl_file,scientific_name,molecule_type(enum),topology(enum),db_name, andemblobjectnextclade_sequence_and_datasetsfieldFor other pipelines (e.g., preprocessing-dummy):
🚀 Preview: Add
previewlabel to enable