feat(prepro): use pydantic to validate and parse the prepro config #5769

anna-parker · 2025-12-19T15:24:37Z

partially resolves #5770

This does not yet use pydantic-settings, in a later step we should also refactor to use it for creating the cli args and reading in the env variables.

Screenshot

PR Checklist

~~All necessary documentation has been adapted.~~
The implemented feature is covered by appropriate, automated tests.
Any manual testing that has been done is documented (i.e. what exactly was tested?) -> confirmed preview works, otherwise this should introduce no logical changes

🚀 Preview: https://prepro-pydantic.loculus.org

anna-parker · 2025-12-19T19:57:36Z

I fixes the issued raised by claude - were very good comments!

claude · 2025-12-19T20:00:40Z

PR Review: feat(prepro): use pydantic to validate and parse the prepro config

This is a solid refactoring that improves type safety and validation. The migration from dataclasses to Pydantic models is well-executed overall. Here are my detailed findings:

✅ Strengths

Better validation: Pydantic will automatically validate types when constructing models, catching configuration errors earlier
Code reduction: -122 lines, +79 lines - the refactoring removes significant boilerplate
Cleaner architecture: Moving ProcessingSpec to config.py and making it a Pydantic model is logical
Smart property for multi_segment: Converting multi_segment from a field to a computed property (config.py:110-112) is elegant

🔴 Critical Issues

1. Environment variable type conversion missing (config.py:181-188)

env_overrides = {}
for key in Config.model_fields:
    env_var = f"PREPROCESSING_{key.upper()}"
    if env_var in os.environ:
        env_overrides[key] = os.environ[env_var]  # ⚠️ Always a string!

Problem: Environment variables are always strings, but Pydantic expects typed values. For example:

PREPROCESSING_BATCH_SIZE=100 will be "100" (string) instead of 100 (int)
PREPROCESSING_KEEP_TMP_DIR=true will be "true" (string) instead of True (bool)

Impact: This will likely cause Pydantic validation errors when env vars are used.

Fix: Either:

Let Pydantic handle the conversion (it can parse strings to proper types):
```
env_overrides[key] = os.environ[env_var]  # Pydantic will convert
```
This should work but needs testing with different types.
Or use pydantic-settings (as mentioned in the PR description) which handles this properly.

2. Pydantic version is quite old

- pydantic=2.12.3

Current latest: Pydantic 2.10.x is latest stable (as of Jan 2025). Version 2.12.3 seems like a typo or version pinning issue.

Action needed: Verify the intended version. If 2.12.3 doesn't exist, this will fail during conda env creation.

⚠️ Security & Best Practices

3. Model validation could be more explicit

The model_validator at line 96-108 mutates the model during validation. While this works, consider:

Using separate validator methods for each concern (better error messages)
The mutation of ds.nextclade_dataset_server modifies nested objects which could be surprising

Suggestion: Consider using Pydantic field_validator for individual field validation where appropriate.

🟡 Code Quality Observations

4. Good: Default values for ProcessingSpec

class ProcessingSpec(BaseModel):
    inputs: FunctionInputs
    function: FunctionName = "identity"
    required: bool = False
    args: FunctionArgs | None = None

The defaults make the config YAML cleaner. Nice improvement!

5. Good: Cleaner processing spec usage (prepro.py:280)

The old code had verbose dictionary unpacking - new code is much cleaner with typed models.

6. Minor: Null check improvement (prepro.py:206)

args = dict(spec.args) if spec.args else {}

Good defensive programming for the args field.

📝 Test Coverage

The PR description says tests are included ✅

Verify: Ensure tests cover:

✅ YAML parsing with the new Pydantic models
⚠️ Environment variable overrides with different types (int, bool, str)
⚠️ CLI argument parsing with the new structure
✅ The multi_segment property computation
✅ The alignment_requirement auto-setting logic

🎯 Recommendations

Before merging:

Fix environment variable type conversion - Test with PREPROCESSING_BATCH_SIZE=100 to ensure conversion to int
Verify pydantic version - Check if 2.12.3 is correct (seems too high)
Add tests for env var type conversion - Especially for int, bool, and enum types

Future improvements (as mentioned in PR):
4. Migrate to pydantic-settings for proper environment variable handling
5. Consider using ConfigDict for model configuration if needed

Overall Assessment

This is a good refactoring that moves in the right direction. The main concern is the environment variable type handling which could break existing deployments if env vars are used. The code quality improvements are solid.

Recommendation: Request changes to fix the environment variable type conversion issue before merging.

Files reviewed:

preprocessing/nextclade/environment.yml ✅
preprocessing/nextclade/src/loculus_preprocessing/config.py ⚠️
preprocessing/nextclade/src/loculus_preprocessing/datatypes.py ✅
preprocessing/nextclade/src/loculus_preprocessing/prepro.py ✅
preprocessing/nextclade/tests/multi_pathogen_config.yaml ✅

anna-parker · 2025-12-19T20:08:50Z

I tested the issue claude raised and can confirm this is not an issue:

nextclade % PREPROCESSING_BATCH_SIZE=10000 prepro
INFO:loculus_preprocessing.__main__:Using config: log_level='DEBUG' keep_tmp_dir=False batch_size=10000 pipeline_version=1

use pydantic for the config

16bdd5f

anna-parker changed the title ~~use pydantic for the config~~ feat(prepro): use pydantic to validate and parse the prepro config Dec 19, 2025

anna-parker added the preview Triggers a deployment to argocd label Dec 19, 2025

anna-parker added 8 commits December 19, 2025 16:45

testing

925bb88

testing

c9a64bf

testing

69e5c9b

make diff smaller

8c3721f

also validate ProcessingSpec in the config

51efe69

nolonger required

a316154

wupps

6f817a1

wupps

862a45b

anna-parker marked this pull request as ready for review December 19, 2025 16:34

This comment was marked as outdated.

Sign in to view

anna-parker marked this pull request as draft December 19, 2025 17:38

anna-parker added 2 commits December 19, 2025 20:51

fixup

7eb59f5

fixup

e6929c9

anna-parker marked this pull request as ready for review December 19, 2025 19:57

anna-parker requested a review from corneliusroemer December 19, 2025 19:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(prepro): use pydantic to validate and parse the prepro config #5769

feat(prepro): use pydantic to validate and parse the prepro config #5769

anna-parker commented Dec 19, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

anna-parker commented Dec 19, 2025

Uh oh!

claude bot commented Dec 19, 2025

Uh oh!

anna-parker commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(prepro): use pydantic to validate and parse the prepro config #5769

Are you sure you want to change the base?

feat(prepro): use pydantic to validate and parse the prepro config #5769

Conversation

anna-parker commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Screenshot

PR Checklist

Uh oh!

This comment was marked as outdated.

anna-parker commented Dec 19, 2025

Uh oh!

claude bot commented Dec 19, 2025

PR Review: feat(prepro): use pydantic to validate and parse the prepro config

✅ Strengths

🔴 Critical Issues

1. Environment variable type conversion missing (config.py:181-188)

2. Pydantic version is quite old

⚠️ Security & Best Practices

3. Model validation could be more explicit

🟡 Code Quality Observations

4. Good: Default values for ProcessingSpec

5. Good: Cleaner processing spec usage (prepro.py:280)

6. Minor: Null check improvement (prepro.py:206)

📝 Test Coverage

🎯 Recommendations

Overall Assessment

Uh oh!

anna-parker commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anna-parker commented Dec 19, 2025 •

edited

Loading