-
Notifications
You must be signed in to change notification settings - Fork 0
Add DatasetModel for discovery metadata #289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #289 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 74 80 +6
Lines 2215 2384 +169
Branches 209 216 +7
==========================================
+ Hits 2215 2384 +169
🚀 New features to boost your workflow:
|
davidlougheed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few general notes:
- missing documentation/commenting about the fields, which schemas they're present in, what the structure is derived from, etc.
- the models should be exported in
__all__ - we should have both a Project and Dataset model I think, where Project is just Dataset but with a list of Datasets
- we should define some kind of abstract base class or similar which adds mappings from these entities to the different schemas we have to support, but maybe this can be done in a PR for v21 or something
v-rocheleau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass focusing on the pcgl model.
Add comprehensive dataset metadata model with support for: - Stakeholder tracking (organizations and people with roles) - Participant criteria (inclusion/exclusion) - Spatial coverage (string or GeoJSON with mandatory name property) - Licensing and data access information - Study-specific fields (domain, status, context) Uses geojson-pydantic for standards-compliant spatial data validation.
Reorganize discovery models by moving DatasetModel and related types into a new provenance submodule for better organization. Add proper exports to provenance/__init__.py for cleaner imports.
Add schema_version_ field to DatasetModel for schema evolution tracking. Rename date_release to release_data for consistency.
Remove optional None from License.url field to ensure license URLs are always provided.
Replace simple publication_links list with structured Publication model that includes title, DOI, publication type (as string with examples), authors, publication date, journal, and description fields for richer publication metadata.
Fix two field name typos in DatasetModel: - schema_version_ -> schema_version (remove trailing underscore) - release_data -> release_date (correct typo)
Convert type aliases to use the modern 'type' keyword syntax: - Role: Use 'type Role = Literal[...]' - StudyDomain: Use 'type StudyDomain = Literal[...]' This is the preferred syntax for type aliases in Python 3.12+.
- Add OntologyTerm Pydantic model for representing ontology terms - Export OntologyTerm from discovery.models module - Update DatasetModel keywords field to accept strings or OntologyTerm objects This allows datasets to use structured ontology terms for keywords instead of just plain strings.
Remove DatasetModel and dataset module re-exports from the top-level discovery.models __init__.py. Provenance models should be imported directly from bento_lib.discovery.models.provenance.dataset instead of being re-exported at the models level.
Add PCGL (Precision Canadian Genomics Library) Study Schema models: - Study: Main PCGL study model with comprehensive metadata - PrincipalInvestigator: Lead researcher information - Collaborator: Study collaborator details - FundingSource: Funding organization information - Type aliases: StudyStatus, StudyContext, StudyDomain using Python 3.12 'type' keyword Models can be imported from bento_lib.discovery.models.provenance.external
Add Pydantic Field descriptions derived from PCGL schema for: - domain: Description and min_length=1 validation - program_name: Description about overarching program This provides better documentation and validation for PCGL-specific fields.
Update List to list and Optional to | None syntax for consistency with Python 3.10+ type hint conventions.
Add comment explaining that the counts field is different from Bento counts and is provided by the metadata creator.
Add converter module with dataset_to_pcgl_study function that transforms DatasetModel instances to PCGL Study schema. The converter: - Extracts principal investigators, lead organizations, collaborators, and funding sources from stakeholders - Converts keywords (handles OntologyTerm) - Filters DOI publication links - Converts participant criteria to formatted string Requires study_id and dac_id as these fields don't exist in DatasetModel.
Add pcgl_study_to_dataset converter function for PCGL → DatasetModel: - Converts PCGL investigators, organizations, and funding to stakeholders - Parses publication DOI URLs to Publication objects - Parses participant criteria string back to structured list - Requires additional metadata (release_date, last_modified, primary_contact) Refactor existing converter for improved conciseness: - Use generator expressions for stakeholder conversion - Condense docstrings to single-line summaries - Remove redundant comments and section separators
- Add 32 tests covering all models in the provenance module - Test OntologyTerm, DatasetModel, and all related models (Person, Organization, Contact, etc.) - Test PCGL Study model and validation rules - Test bidirectional converters (dataset_to_pcgl_study and pcgl_study_to_dataset) - Test edge cases, error handling, and roundtrip conversions - Add "Data Contributor" role to Role literal type - Export Other class from provenance module
…el and PCGL converters Add 11 new tests covering: - SpatialCoverageFeature and SpatialCoverageProperties models - Converter helper functions for extracting PIs, organizations, collaborators, and funders - Edge cases: empty affiliations, multiple leadership roles, Person funders, non-DOI publications - Malformed participant criteria parsing - Optional parameters in pcgl_study_to_dataset converter Total tests increased from 34 to 45.
Add test_converter_person_funder_only to explicitly test the elif branch for Person funders without Organization funders in the same test execution.
…tures Reorganize test_discovery_provenance.py (1997 lines) into 5 focused modules: - test_provenance_fixtures.py: Shared pytest fixtures for reusable test objects - test_provenance_models.py: Tests for basic models (Contact, Person, Organization, etc.) - test_provenance_pcgl.py: Tests for PCGL Study models and validation - test_provenance_dataset.py: Tests for DatasetModel and roundtrip conversions - test_provenance_converters.py: Tests for dataset<->study converters Benefits: - Reduced total lines from 1997 to 1313 (34% reduction) - Better test organization by logical domain - Shared fixtures eliminate duplication - Easier to locate and maintain specific test categories - All 41 tests pass with 99% coverage maintained
- Remove unused Person import from test_provenance_dataset.py - Remove unused pytest and ValidationError imports from test_provenance_models.py - Remove unused DatasetModel import from test_provenance_models.py All tests still pass (41/41).
- Changed Person model from first_name/last_name to single name field - Changed Person.role to Person.roles (list) - Changed Organization.role to Organization.roles (list) - Updated all converter functions to use new field names - Updated all tests to match new model structure - Updated CLAUDE.md to exclude co-author footers in commits - All 41 tests pass with 99% code coverage
…nization] - Updated Publication model to support richer author metadata - Authors can now include affiliations, roles, and contact information - Updated test_publication to use Person and Organization objects - Added new test_publication_with_mixed_authors test - All 42 tests pass with 99% code coverage BREAKING CHANGE: Publication.authors type changed from list[str] to list[Person | Organization]
- Changed keywords, collaborators, and publicationLinks from Optional[list[T]] to list[T] - Use default_factory=list instead of None for empty lists - Updated validator for publication_links to handle non-optional list - Updated converters to work with empty lists instead of None - Updated all tests to use empty lists [] instead of None - All 42 tests pass with 99% code coverage
…rays - Removed StudyDomain Literal type from both PCGL and DatasetModel - Changed domain field to list[str] in both models - Removed Other type usage for custom domains (now just strings) - Simplified converters to pass domain strings directly - Updated all tests to use plain string domains instead of Other type - Removed StudyDomain from module exports - All 42 tests pass with 99% code coverage This makes the domain field more flexible and allows any custom domain strings without requiring the Other wrapper type.
- Changed StudyStatus from 'Ongoing'/'Completed' to 'ONGOING'/'COMPLETED' - Changed StudyContext from 'Clinical'/'Research' to 'CLINICAL'/'RESEARCH' - Updated both PCGL Study and DatasetModel to use uppercase values - Updated all tests to use uppercase values - All 42 tests pass with 99% code coverage This follows common convention of using uppercase for enum-like constants.
- Removed Optional import from typing - Changed Optional[str] to str | None for: - Collaborator.role - FundingSource.grant_number - Study.program_name - Study.participant_criteria - Uses modern Python 3.10+ union syntax - All 42 tests pass with 99% code coverage
Add StudyDomain Literal type to PCGL Study model with 13 specific domain options: Aging, Birth Defects, Cancer, Circulatory and Respiratory Health, General Health, Infection and Immunity, Musculoskeletal Health and Arthritis, Neurodevelopmental Conditions, Neurosciences Mental Health and Addiction, Nutrition Metabolism and Diabetes, Population Genomics, Rare Diseases, and Other. Update Study.domain field from list[str] to list[StudyDomain] to enforce validation. DatasetModel.domain remains list[str] to support generic use cases, with validation occurring during conversion to PCGL Study. Update tests to use valid domain values and add validation tests for invalid domains. Fix typo in test_pcgl_study_validation_empty_lists.
Remove the dataset_to_pcgl_study conversion function and related helper functions as they are not needed. The PCGL Study models are intended to be populated from external PCGL data sources, not converted from DatasetModel. Changes: - Remove dataset_to_pcgl_study function from converters/pcgl.py - Remove helper functions: _extract_principal_investigators, _extract_lead_organizations, _extract_collaborators, _extract_funding_sources, _extract_doi_publication_links, _convert_participant_criteria - Remove dataset_to_pcgl_study from __init__.py exports - Remove all tests for dataset_to_pcgl_study conversion - Remove roundtrip conversion test from test_provenance_dataset.py - Keep pcgl_study_to_dataset converter for importing PCGL data The library now only supports one-way conversion from PCGL Study to DatasetModel, which aligns with the intended use case of importing external PCGL study data into the Bento platform.
- Add Logo model with theme-specific variants (light/dark/default) - Rename PCGL fields with pcgl_ prefix (domain, status, context, program_name) - Update PCGL converter to use new field names with type conversion - Update all tests to reflect field name changes
- Add extra_properties field to store custom metadata - Support string, int, float, bool, and null values - Add test for extra_properties validation
- Replace journal parameter with publication_venue in Publication tests - Add links=[] to all DatasetModel instantiations - Create PublicationVenue objects in publication tests - Export Link and PublicationVenue from provenance __init__
- Add dataset_id field to DatasetModel (inherited from PCGL study_id or created in katsu) - Update PCGL converter to map study.study_id to dataset.dataset_id - Update all test DatasetModel instantiations to include dataset_id
98dbf12 to
a8ce65e
Compare
…logies.models Migrate provenance models to use the new OntologyClass model which includes CURIE pattern validation for ontology term IDs.
Summary
DatasetModelfor dataset discovery metadata with stakeholder tracking, participant criteria, and licensing informationgeojson-pydantic