Skip to content

Conversation

@SanjeevLakhwani
Copy link

Summary

  • Add comprehensive DatasetModel for dataset discovery metadata with stakeholder tracking, participant criteria, and licensing information
  • Add GeoJSON spatial coverage support with mandatory name property using geojson-pydantic
  • Include study-specific fields (domain, status, context) and fix type safety issues in supporting models

@codecov
Copy link

codecov bot commented Nov 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (533a7f9) to head (f38453a).

Additional details and impacted files
@@            Coverage Diff             @@
##            master      #289    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           74        80     +6     
  Lines         2215      2384   +169     
  Branches       209       216     +7     
==========================================
+ Hits          2215      2384   +169     
Files with missing lines Coverage Δ
bento_lib/discovery/models/__init__.py 100.00% <100.00%> (ø)
bento_lib/discovery/models/provenance/__init__.py 100.00% <100.00%> (ø)
...discovery/models/provenance/converters/__init__.py 100.00% <100.00%> (ø)
...lib/discovery/models/provenance/converters/pcgl.py 100.00% <100.00%> (ø)
bento_lib/discovery/models/provenance/dataset.py 100.00% <100.00%> (ø)
...b/discovery/models/provenance/external/__init__.py 100.00% <100.00%> (ø)
...o_lib/discovery/models/provenance/external/pcgl.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@davidlougheed davidlougheed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few general notes:

  • missing documentation/commenting about the fields, which schemas they're present in, what the structure is derived from, etc.
  • the models should be exported in __all__
  • we should have both a Project and Dataset model I think, where Project is just Dataset but with a list of Datasets
  • we should define some kind of abstract base class or similar which adds mappings from these entities to the different schemas we have to support, but maybe this can be done in a PR for v21 or something

@SanjeevLakhwani SanjeevLakhwani marked this pull request as draft November 20, 2025 20:00
Copy link

@v-rocheleau v-rocheleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass focusing on the pcgl model.

Add comprehensive dataset metadata model with support for:
- Stakeholder tracking (organizations and people with roles)
- Participant criteria (inclusion/exclusion)
- Spatial coverage (string or GeoJSON with mandatory name property)
- Licensing and data access information
- Study-specific fields (domain, status, context)

Uses geojson-pydantic for standards-compliant spatial data validation.
Reorganize discovery models by moving DatasetModel and related types
into a new provenance submodule for better organization. Add proper
exports to provenance/__init__.py for cleaner imports.
Add schema_version_ field to DatasetModel for schema evolution tracking.
Rename date_release to release_data for consistency.
Remove optional None from License.url field to ensure license URLs are always provided.
Replace simple publication_links list with structured Publication model
that includes title, DOI, publication type (as string with examples),
authors, publication date, journal, and description fields for richer
publication metadata.
Fix two field name typos in DatasetModel:
- schema_version_ -> schema_version (remove trailing underscore)
- release_data -> release_date (correct typo)
Convert type aliases to use the modern 'type' keyword syntax:
- Role: Use 'type Role = Literal[...]'
- StudyDomain: Use 'type StudyDomain = Literal[...]'

This is the preferred syntax for type aliases in Python 3.12+.
- Add OntologyTerm Pydantic model for representing ontology terms
- Export OntologyTerm from discovery.models module
- Update DatasetModel keywords field to accept strings or OntologyTerm objects

This allows datasets to use structured ontology terms for keywords
instead of just plain strings.
Remove DatasetModel and dataset module re-exports from the top-level
discovery.models __init__.py. Provenance models should be imported
directly from bento_lib.discovery.models.provenance.dataset instead
of being re-exported at the models level.
Add PCGL (Precision Canadian Genomics Library) Study Schema models:
- Study: Main PCGL study model with comprehensive metadata
- PrincipalInvestigator: Lead researcher information
- Collaborator: Study collaborator details
- FundingSource: Funding organization information
- Type aliases: StudyStatus, StudyContext, StudyDomain using Python 3.12 'type' keyword

Models can be imported from bento_lib.discovery.models.provenance.external
Add Pydantic Field descriptions derived from PCGL schema for:
- domain: Description and min_length=1 validation
- program_name: Description about overarching program

This provides better documentation and validation for PCGL-specific fields.
Update List to list and Optional to | None syntax for consistency
with Python 3.10+ type hint conventions.
Add comment explaining that the counts field is different from Bento
counts and is provided by the metadata creator.
Add converter module with dataset_to_pcgl_study function that transforms
DatasetModel instances to PCGL Study schema. The converter:
- Extracts principal investigators, lead organizations, collaborators,
  and funding sources from stakeholders
- Converts keywords (handles OntologyTerm)
- Filters DOI publication links
- Converts participant criteria to formatted string

Requires study_id and dac_id as these fields don't exist in DatasetModel.
Add pcgl_study_to_dataset converter function for PCGL → DatasetModel:
- Converts PCGL investigators, organizations, and funding to stakeholders
- Parses publication DOI URLs to Publication objects
- Parses participant criteria string back to structured list
- Requires additional metadata (release_date, last_modified, primary_contact)

Refactor existing converter for improved conciseness:
- Use generator expressions for stakeholder conversion
- Condense docstrings to single-line summaries
- Remove redundant comments and section separators
- Add 32 tests covering all models in the provenance module
- Test OntologyTerm, DatasetModel, and all related models (Person, Organization, Contact, etc.)
- Test PCGL Study model and validation rules
- Test bidirectional converters (dataset_to_pcgl_study and pcgl_study_to_dataset)
- Test edge cases, error handling, and roundtrip conversions
- Add "Data Contributor" role to Role literal type
- Export Other class from provenance module
…el and PCGL converters

Add 11 new tests covering:
- SpatialCoverageFeature and SpatialCoverageProperties models
- Converter helper functions for extracting PIs, organizations, collaborators, and funders
- Edge cases: empty affiliations, multiple leadership roles, Person funders, non-DOI publications
- Malformed participant criteria parsing
- Optional parameters in pcgl_study_to_dataset converter

Total tests increased from 34 to 45.
Add test_converter_person_funder_only to explicitly test the elif branch
for Person funders without Organization funders in the same test execution.
…tures

Reorganize test_discovery_provenance.py (1997 lines) into 5 focused modules:
- test_provenance_fixtures.py: Shared pytest fixtures for reusable test objects
- test_provenance_models.py: Tests for basic models (Contact, Person, Organization, etc.)
- test_provenance_pcgl.py: Tests for PCGL Study models and validation
- test_provenance_dataset.py: Tests for DatasetModel and roundtrip conversions
- test_provenance_converters.py: Tests for dataset<->study converters

Benefits:
- Reduced total lines from 1997 to 1313 (34% reduction)
- Better test organization by logical domain
- Shared fixtures eliminate duplication
- Easier to locate and maintain specific test categories
- All 41 tests pass with 99% coverage maintained
- Remove unused Person import from test_provenance_dataset.py
- Remove unused pytest and ValidationError imports from test_provenance_models.py
- Remove unused DatasetModel import from test_provenance_models.py

All tests still pass (41/41).
- Changed Person model from first_name/last_name to single name field
- Changed Person.role to Person.roles (list)
- Changed Organization.role to Organization.roles (list)
- Updated all converter functions to use new field names
- Updated all tests to match new model structure
- Updated CLAUDE.md to exclude co-author footers in commits
- All 41 tests pass with 99% code coverage
…nization]

- Updated Publication model to support richer author metadata
- Authors can now include affiliations, roles, and contact information
- Updated test_publication to use Person and Organization objects
- Added new test_publication_with_mixed_authors test
- All 42 tests pass with 99% code coverage

BREAKING CHANGE: Publication.authors type changed from list[str] to list[Person | Organization]
- Changed keywords, collaborators, and publicationLinks from Optional[list[T]] to list[T]
- Use default_factory=list instead of None for empty lists
- Updated validator for publication_links to handle non-optional list
- Updated converters to work with empty lists instead of None
- Updated all tests to use empty lists [] instead of None
- All 42 tests pass with 99% code coverage
…rays

- Removed StudyDomain Literal type from both PCGL and DatasetModel
- Changed domain field to list[str] in both models
- Removed Other type usage for custom domains (now just strings)
- Simplified converters to pass domain strings directly
- Updated all tests to use plain string domains instead of Other type
- Removed StudyDomain from module exports
- All 42 tests pass with 99% code coverage

This makes the domain field more flexible and allows any custom domain strings
without requiring the Other wrapper type.
- Changed StudyStatus from 'Ongoing'/'Completed' to 'ONGOING'/'COMPLETED'
- Changed StudyContext from 'Clinical'/'Research' to 'CLINICAL'/'RESEARCH'
- Updated both PCGL Study and DatasetModel to use uppercase values
- Updated all tests to use uppercase values
- All 42 tests pass with 99% code coverage

This follows common convention of using uppercase for enum-like constants.
- Removed Optional import from typing
- Changed Optional[str] to str | None for:
  - Collaborator.role
  - FundingSource.grant_number
  - Study.program_name
  - Study.participant_criteria
- Uses modern Python 3.10+ union syntax
- All 42 tests pass with 99% code coverage
Add StudyDomain Literal type to PCGL Study model with 13 specific
domain options: Aging, Birth Defects, Cancer, Circulatory and
Respiratory Health, General Health, Infection and Immunity,
Musculoskeletal Health and Arthritis, Neurodevelopmental Conditions,
Neurosciences Mental Health and Addiction, Nutrition Metabolism and
Diabetes, Population Genomics, Rare Diseases, and Other.

Update Study.domain field from list[str] to list[StudyDomain] to
enforce validation. DatasetModel.domain remains list[str] to support
generic use cases, with validation occurring during conversion to
PCGL Study.

Update tests to use valid domain values and add validation tests for
invalid domains. Fix typo in test_pcgl_study_validation_empty_lists.
Remove the dataset_to_pcgl_study conversion function and related helper
functions as they are not needed. The PCGL Study models are intended to
be populated from external PCGL data sources, not converted from
DatasetModel.

Changes:
- Remove dataset_to_pcgl_study function from converters/pcgl.py
- Remove helper functions: _extract_principal_investigators,
  _extract_lead_organizations, _extract_collaborators,
  _extract_funding_sources, _extract_doi_publication_links,
  _convert_participant_criteria
- Remove dataset_to_pcgl_study from __init__.py exports
- Remove all tests for dataset_to_pcgl_study conversion
- Remove roundtrip conversion test from test_provenance_dataset.py
- Keep pcgl_study_to_dataset converter for importing PCGL data

The library now only supports one-way conversion from PCGL Study to
DatasetModel, which aligns with the intended use case of importing
external PCGL study data into the Bento platform.
- Add Logo model with theme-specific variants (light/dark/default)
- Rename PCGL fields with pcgl_ prefix (domain, status, context, program_name)
- Update PCGL converter to use new field names with type conversion
- Update all tests to reflect field name changes
- Add extra_properties field to store custom metadata
- Support string, int, float, bool, and null values
- Add test for extra_properties validation
- Replace journal parameter with publication_venue in Publication tests
- Add links=[] to all DatasetModel instantiations
- Create PublicationVenue objects in publication tests
- Export Link and PublicationVenue from provenance __init__
- Add dataset_id field to DatasetModel (inherited from PCGL study_id or created in katsu)
- Update PCGL converter to map study.study_id to dataset.dataset_id
- Update all test DatasetModel instantiations to include dataset_id
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants