Skip to content

Conversation

@anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Dec 19, 2025

partially resolves #5770

This does not yet use pydantic-settings, in a later step we should also refactor to use it for creating the cli args and reading in the env variables.

Screenshot

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?) -> confirmed preview works, otherwise this should introduce no logical changes

🚀 Preview: https://prepro-pydantic.loculus.org

@anna-parker anna-parker changed the title use pydantic for the config feat(prepro): use pydantic to validate and parse the prepro config Dec 19, 2025
@anna-parker anna-parker added the preview Triggers a deployment to argocd label Dec 19, 2025
@anna-parker anna-parker marked this pull request as ready for review December 19, 2025 16:34
@claude

This comment was marked as outdated.

@anna-parker anna-parker marked this pull request as draft December 19, 2025 17:38
@anna-parker
Copy link
Contributor Author

I fixes the issued raised by claude - were very good comments!

@anna-parker anna-parker marked this pull request as ready for review December 19, 2025 19:57
@claude
Copy link
Contributor

claude bot commented Dec 19, 2025

PR Review: feat(prepro): use pydantic to validate and parse the prepro config

This is a solid refactoring that improves type safety and validation. The migration from dataclasses to Pydantic models is well-executed overall. Here are my detailed findings:

✅ Strengths

  1. Better validation: Pydantic will automatically validate types when constructing models, catching configuration errors earlier
  2. Code reduction: -122 lines, +79 lines - the refactoring removes significant boilerplate
  3. Cleaner architecture: Moving ProcessingSpec to config.py and making it a Pydantic model is logical
  4. Smart property for multi_segment: Converting multi_segment from a field to a computed property (config.py:110-112) is elegant

🔴 Critical Issues

1. Environment variable type conversion missing (config.py:181-188)

env_overrides = {}
for key in Config.model_fields:
    env_var = f"PREPROCESSING_{key.upper()}"
    if env_var in os.environ:
        env_overrides[key] = os.environ[env_var]  # ⚠️ Always a string!

Problem: Environment variables are always strings, but Pydantic expects typed values. For example:

  • PREPROCESSING_BATCH_SIZE=100 will be "100" (string) instead of 100 (int)
  • PREPROCESSING_KEEP_TMP_DIR=true will be "true" (string) instead of True (bool)

Impact: This will likely cause Pydantic validation errors when env vars are used.

Fix: Either:

  1. Let Pydantic handle the conversion (it can parse strings to proper types):

    env_overrides[key] = os.environ[env_var]  # Pydantic will convert

    This should work but needs testing with different types.

  2. Or use pydantic-settings (as mentioned in the PR description) which handles this properly.

2. Pydantic version is quite old

- pydantic=2.12.3

Current latest: Pydantic 2.10.x is latest stable (as of Jan 2025). Version 2.12.3 seems like a typo or version pinning issue.

Action needed: Verify the intended version. If 2.12.3 doesn't exist, this will fail during conda env creation.

⚠️ Security & Best Practices

3. Model validation could be more explicit

The model_validator at line 96-108 mutates the model during validation. While this works, consider:

  • Using separate validator methods for each concern (better error messages)
  • The mutation of ds.nextclade_dataset_server modifies nested objects which could be surprising

Suggestion: Consider using Pydantic field_validator for individual field validation where appropriate.

🟡 Code Quality Observations

4. Good: Default values for ProcessingSpec

class ProcessingSpec(BaseModel):
    inputs: FunctionInputs
    function: FunctionName = "identity"
    required: bool = False
    args: FunctionArgs | None = None

The defaults make the config YAML cleaner. Nice improvement!

5. Good: Cleaner processing spec usage (prepro.py:280)

The old code had verbose dictionary unpacking - new code is much cleaner with typed models.

6. Minor: Null check improvement (prepro.py:206)

args = dict(spec.args) if spec.args else {}

Good defensive programming for the args field.

📝 Test Coverage

The PR description says tests are included ✅

Verify: Ensure tests cover:

  • ✅ YAML parsing with the new Pydantic models
  • ⚠️ Environment variable overrides with different types (int, bool, str)
  • ⚠️ CLI argument parsing with the new structure
  • ✅ The multi_segment property computation
  • ✅ The alignment_requirement auto-setting logic

🎯 Recommendations

Before merging:

  1. Fix environment variable type conversion - Test with PREPROCESSING_BATCH_SIZE=100 to ensure conversion to int
  2. Verify pydantic version - Check if 2.12.3 is correct (seems too high)
  3. Add tests for env var type conversion - Especially for int, bool, and enum types

Future improvements (as mentioned in PR):
4. Migrate to pydantic-settings for proper environment variable handling
5. Consider using ConfigDict for model configuration if needed

Overall Assessment

This is a good refactoring that moves in the right direction. The main concern is the environment variable type handling which could break existing deployments if env vars are used. The code quality improvements are solid.

Recommendation: Request changes to fix the environment variable type conversion issue before merging.


Files reviewed:

  • preprocessing/nextclade/environment.yml ✅
  • preprocessing/nextclade/src/loculus_preprocessing/config.py ⚠️
  • preprocessing/nextclade/src/loculus_preprocessing/datatypes.py ✅
  • preprocessing/nextclade/src/loculus_preprocessing/prepro.py ✅
  • preprocessing/nextclade/tests/multi_pathogen_config.yaml ✅

@anna-parker
Copy link
Contributor Author

I tested the issue claude raised and can confirm this is not an issue:

nextclade % PREPROCESSING_BATCH_SIZE=10000 prepro
INFO:loculus_preprocessing.__main__:Using config: log_level='DEBUG' keep_tmp_dir=False batch_size=10000 pipeline_version=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview Triggers a deployment to argocd

Projects

None yet

Development

Successfully merging this pull request may close these issues.

prepro: refactor config to improve typing

2 participants