Add DatasetModel for discovery metadata #289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

SanjeevLakhwani wants to merge 41 commits into master from feature/dataset-model

+1,506 −0

SanjeevLakhwani commented Nov 20, 2025

Summary

Add comprehensive DatasetModel for dataset discovery metadata with stakeholder tracking, participant criteria, and licensing information
Add GeoJSON spatial coverage support with mandatory name property using geojson-pydantic
Include study-specific fields (domain, status, context) and fix type safety issues in supporting models

codecov bot commented Nov 20, 2025 •

edited

Loading

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (533a7f9) to head (f38453a).

Additional details and impacted files

@@            Coverage Diff             @@
##            master      #289    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           74        80     +6     
  Lines         2215      2384   +169     
  Branches       209       216     +7     
==========================================
+ Hits          2215      2384   +169

Files with missing lines	Coverage Δ
bento_lib/discovery/models/__init__.py	`100.00% <100.00%> (ø)`
bento_lib/discovery/models/provenance/__init__.py	`100.00% <100.00%> (ø)`
...discovery/models/provenance/converters/__init__.py	`100.00% <100.00%> (ø)`
...lib/discovery/models/provenance/converters/pcgl.py	`100.00% <100.00%> (ø)`
bento_lib/discovery/models/provenance/dataset.py	`100.00% <100.00%> (ø)`
...b/discovery/models/provenance/external/__init__.py	`100.00% <100.00%> (ø)`
...o_lib/discovery/models/provenance/external/pcgl.py	`100.00% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

davidlougheed requested changes

View reviewed changes

Member

davidlougheed left a comment

a few general notes:

missing documentation/commenting about the fields, which schemas they're present in, what the structure is derived from, etc.
the models should be exported in __all__
we should have both a Project and Dataset model I think, where Project is just Dataset but with a list of Datasets
we should define some kind of abstract base class or similar which adds mappings from these entities to the different schemas we have to support, but maybe this can be done in a PR for v21 or something

bento_lib/discovery/models/__init__.py Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Show resolved Hide resolved

SanjeevLakhwani marked this pull request as draft

November 20, 2025 20:00

SanjeevLakhwani requested a review from davidlougheed

November 21, 2025 18:18

davidlougheed requested changes

View reviewed changes

bento_lib/discovery/models/__init__.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Outdated Show resolved Hide resolved

SanjeevLakhwani requested a review from davidlougheed

November 24, 2025 19:33

davidlougheed requested changes

View reviewed changes

bento_lib/discovery/models/provenance/external/pcgl.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/external/pcgl.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/converters/pcgl.py Show resolved Hide resolved

bento_lib/discovery/models/ontology.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Outdated Show resolved Hide resolved

v-rocheleau requested changes

View reviewed changes

v-rocheleau left a comment

First pass focusing on the pcgl model.

bento_lib/discovery/models/provenance/external/pcgl.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/external/pcgl.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/external/pcgl.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/external/pcgl.py Show resolved Hide resolved

bento_lib/discovery/models/provenance/external/pcgl.py Show resolved Hide resolved

bento_lib/discovery/models/provenance/external/pcgl.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/external/pcgl.py Show resolved Hide resolved

bento_lib/discovery/models/provenance/external/pcgl.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/external/pcgl.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/converters/pcgl.py Show resolved Hide resolved

SanjeevLakhwani requested review from davidlougheed and v-rocheleau

November 27, 2025 19:51

davidlougheed reviewed

View reviewed changes

bento_lib/discovery/models/provenance/dataset.py Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Show resolved Hide resolved

SanjeevLakhwani requested a review from davidlougheed

December 17, 2025 19:38

davidlougheed requested changes

View reviewed changes

bento_lib/discovery/models/ontology.py Outdated Show resolved Hide resolved

bento_lib/discovery/models/provenance/dataset.py Show resolved Hide resolved

bento_lib/discovery/models/provenance/external/pcgl.py Show resolved Hide resolved

tests/test_provenance_converters.py Outdated Show resolved Hide resolved

tests/test_provenance_dataset.py Outdated Show resolved Hide resolved

SanjeevLakhwani added 17 commits

January 7, 2026 11:52


          feat: add DatasetModel for discovery metadata

2d94455

Add comprehensive dataset metadata model with support for:
- Stakeholder tracking (organizations and people with roles)
- Participant criteria (inclusion/exclusion)
- Spatial coverage (string or GeoJSON with mandatory name property)
- Licensing and data access information
- Study-specific fields (domain, status, context)

Uses geojson-pydantic for standards-compliant spatial data validation.


          style: fix formatting in dataset.py

505750b


          refactor: move DatasetModel to provenance submodule

9d41c93

Reorganize discovery models by moving DatasetModel and related types
into a new provenance submodule for better organization. Add proper
exports to provenance/__init__.py for cleaner imports.


          feat: add schema version and fix date_release field name

Add schema_version_ field to DatasetModel for schema evolution tracking.
Rename date_release to release_data for consistency.


          fix: make License.url field required

898ac15

Remove optional None from License.url field to ensure license URLs are always provided.


          feat: add Publication model with detailed metadata fields

fa095df

Replace simple publication_links list with structured Publication model
that includes title, DOI, publication type (as string with examples),
authors, publication date, journal, and description fields for richer
publication metadata.


          fix: correct field names in DatasetModel schema

b930473

Fix two field name typos in DatasetModel:
- schema_version_ -> schema_version (remove trailing underscore)
- release_data -> release_date (correct typo)


          refactor: use Python 3.12 type keyword for type aliases

eca07f7

Convert type aliases to use the modern 'type' keyword syntax:
- Role: Use 'type Role = Literal[...]'
- StudyDomain: Use 'type StudyDomain = Literal[...]'

This is the preferred syntax for type aliases in Python 3.12+.


          feat: add OntologyTerm model and integrate with DatasetModel

f2098cd

- Add OntologyTerm Pydantic model for representing ontology terms
- Export OntologyTerm from discovery.models module
- Update DatasetModel keywords field to accept strings or OntologyTerm objects

This allows datasets to use structured ontology terms for keywords
instead of just plain strings.


          refactor: remove provenance model re-exports from discovery.models

a763664

Remove DatasetModel and dataset module re-exports from the top-level
discovery.models __init__.py. Provenance models should be imported
directly from bento_lib.discovery.models.provenance.dataset instead
of being re-exported at the models level.


          feat: add PCGL external schema models with type aliases

6df7383

Add PCGL (Precision Canadian Genomics Library) Study Schema models:
- Study: Main PCGL study model with comprehensive metadata
- PrincipalInvestigator: Lead researcher information
- Collaborator: Study collaborator details
- FundingSource: Funding organization information
- Type aliases: StudyStatus, StudyContext, StudyDomain using Python 3.12 'type' keyword

Models can be imported from bento_lib.discovery.models.provenance.external


          feat: add Field descriptions to PCGL-specific fields in DatasetModel

5bdc08c

Add Pydantic Field descriptions derived from PCGL schema for:
- domain: Description and min_length=1 validation
- program_name: Description about overarching program

This provides better documentation and validation for PCGL-specific fields.


          style: update type hints in PCGL schema to use lowercase generics

e1b3ed9

Update List to list and Optional to | None syntax for consistency
with Python 3.10+ type hint conventions.


          docs: add clarifying comment to counts field in DatasetModel

239d3cd

Add comment explaining that the counts field is different from Bento
counts and is provided by the metadata creator.


          feat: add DatasetModel to PCGL Study converter

c09511a

Add converter module with dataset_to_pcgl_study function that transforms
DatasetModel instances to PCGL Study schema. The converter:
- Extracts principal investigators, lead organizations, collaborators,
  and funding sources from stakeholders
- Converts keywords (handles OntologyTerm)
- Filters DOI publication links
- Converts participant criteria to formatted string

Requires study_id and dac_id as these fields don't exist in DatasetModel.


          feat: add bidirectional converter between PCGL Study and DatasetModel

468c539

Add pcgl_study_to_dataset converter function for PCGL → DatasetModel:
- Converts PCGL investigators, organizations, and funding to stakeholders
- Parses publication DOI URLs to Publication objects
- Parses participant criteria string back to structured list
- Requires additional metadata (release_date, last_modified, primary_contact)

Refactor existing converter for improved conciseness:
- Use generator expressions for stakeholder conversion
- Condense docstrings to single-line summaries
- Remove redundant comments and section separators


          test: add comprehensive tests for DatasetModel and PCGL Study converters

0d78c7d

- Add 32 tests covering all models in the provenance module
- Test OntologyTerm, DatasetModel, and all related models (Person, Organization, Contact, etc.)
- Test PCGL Study model and validation rules
- Test bidirectional converters (dataset_to_pcgl_study and pcgl_study_to_dataset)
- Test edge cases, error handling, and roundtrip conversions
- Add "Data Contributor" role to Role literal type
- Export Other class from provenance module

SanjeevLakhwani added 20 commits

January 7, 2026 11:52


          test: add comprehensive tests to improve code coverage for DatasetMod…

2605ed4

…el and PCGL converters

Add 11 new tests covering:
- SpatialCoverageFeature and SpatialCoverageProperties models
- Converter helper functions for extracting PIs, organizations, collaborators, and funders
- Edge cases: empty affiliations, multiple leadership roles, Person funders, non-DOI publications
- Malformed participant criteria parsing
- Optional parameters in pcgl_study_to_dataset converter

Total tests increased from 34 to 45.


          test: add test for Person-only funder to improve branch coverage

bbb17d0

Add test_converter_person_funder_only to explicitly test the elif branch
for Person funders without Organization funders in the same test execution.


          refactor: split provenance tests into logical modules with shared fix…

0f52017

…tures

Reorganize test_discovery_provenance.py (1997 lines) into 5 focused modules:
- test_provenance_fixtures.py: Shared pytest fixtures for reusable test objects
- test_provenance_models.py: Tests for basic models (Contact, Person, Organization, etc.)
- test_provenance_pcgl.py: Tests for PCGL Study models and validation
- test_provenance_dataset.py: Tests for DatasetModel and roundtrip conversions
- test_provenance_converters.py: Tests for dataset<->study converters

Benefits:
- Reduced total lines from 1997 to 1313 (34% reduction)
- Better test organization by logical domain
- Shared fixtures eliminate duplication
- Easier to locate and maintain specific test categories
- All 41 tests pass with 99% coverage maintained


          fix: remove unused imports from provenance tests

171de31

- Remove unused Person import from test_provenance_dataset.py
- Remove unused pytest and ValidationError imports from test_provenance_models.py
- Remove unused DatasetModel import from test_provenance_models.py

All tests still pass (41/41).


          refactor: change Person and Organization role field to roles (plural)

bddf5cb

- Changed Person model from first_name/last_name to single name field
- Changed Person.role to Person.roles (list)
- Changed Organization.role to Organization.roles (list)
- Updated all converter functions to use new field names
- Updated all tests to match new model structure
- Updated CLAUDE.md to exclude co-author footers in commits
- All 41 tests pass with 99% code coverage


          docs: add schema.org reference to Contact model docstring

af2c11c


          docs: clarify usage of 'Other' in StudyDomain type

20af55e


          feat: change Publication.authors from list[str] to list[Person | Orga…

ec79929

…nization]

- Updated Publication model to support richer author metadata
- Authors can now include affiliations, roles, and contact information
- Updated test_publication to use Person and Organization objects
- Added new test_publication_with_mixed_authors test
- All 42 tests pass with 99% code coverage

BREAKING CHANGE: Publication.authors type changed from list[str] to list[Person | Organization]


          refactor: remove Optional from list fields in PCGL Study model

b38cad0

- Changed keywords, collaborators, and publicationLinks from Optional[list[T]] to list[T]
- Use default_factory=list instead of None for empty lists
- Updated validator for publication_links to handle non-optional list
- Updated converters to work with empty lists instead of None
- Updated all tests to use empty lists [] instead of None
- All 42 tests pass with 99% code coverage


          refactor: remove StudyDomain type restriction and use plain string ar…

a508497

…rays

- Removed StudyDomain Literal type from both PCGL and DatasetModel
- Changed domain field to list[str] in both models
- Removed Other type usage for custom domains (now just strings)
- Simplified converters to pass domain strings directly
- Updated all tests to use plain string domains instead of Other type
- Removed StudyDomain from module exports
- All 42 tests pass with 99% code coverage

This makes the domain field more flexible and allows any custom domain strings
without requiring the Other wrapper type.


          refactor: change status and context Literal values to uppercase

97890e2

- Changed StudyStatus from 'Ongoing'/'Completed' to 'ONGOING'/'COMPLETED'
- Changed StudyContext from 'Clinical'/'Research' to 'CLINICAL'/'RESEARCH'
- Updated both PCGL Study and DatasetModel to use uppercase values
- Updated all tests to use uppercase values
- All 42 tests pass with 99% code coverage

This follows common convention of using uppercase for enum-like constants.


          refactor: replace Optional with | None syntax in PCGL models

9bbb6e2

- Removed Optional import from typing
- Changed Optional[str] to str | None for:
  - Collaborator.role
  - FundingSource.grant_number
  - Study.program_name
  - Study.participant_criteria
- Uses modern Python 3.10+ union syntax
- All 42 tests pass with 99% code coverage


          fix: typo

29b2257


          feat: add StudyDomain Literal type with specific domain values

96888a6

Add StudyDomain Literal type to PCGL Study model with 13 specific
domain options: Aging, Birth Defects, Cancer, Circulatory and
Respiratory Health, General Health, Infection and Immunity,
Musculoskeletal Health and Arthritis, Neurodevelopmental Conditions,
Neurosciences Mental Health and Addiction, Nutrition Metabolism and
Diabetes, Population Genomics, Rare Diseases, and Other.

Update Study.domain field from list[str] to list[StudyDomain] to
enforce validation. DatasetModel.domain remains list[str] to support
generic use cases, with validation occurring during conversion to
PCGL Study.

Update tests to use valid domain values and add validation tests for
invalid domains. Fix typo in test_pcgl_study_validation_empty_lists.


          refactor: remove dataset_to_pcgl_study converter

19e5152

Remove the dataset_to_pcgl_study conversion function and related helper
functions as they are not needed. The PCGL Study models are intended to
be populated from external PCGL data sources, not converted from
DatasetModel.

Changes:
- Remove dataset_to_pcgl_study function from converters/pcgl.py
- Remove helper functions: _extract_principal_investigators,
  _extract_lead_organizations, _extract_collaborators,
  _extract_funding_sources, _extract_doi_publication_links,
  _convert_participant_criteria
- Remove dataset_to_pcgl_study from __init__.py exports
- Remove all tests for dataset_to_pcgl_study conversion
- Remove roundtrip conversion test from test_provenance_dataset.py
- Keep pcgl_study_to_dataset converter for importing PCGL data

The library now only supports one-way conversion from PCGL Study to
DatasetModel, which aligns with the intended use case of importing
external PCGL study data into the Bento platform.


          feat: add Logo model and prefix PCGL-specific fields

71c1bd0

- Add Logo model with theme-specific variants (light/dark/default)
- Rename PCGL fields with pcgl_ prefix (domain, status, context, program_name)
- Update PCGL converter to use new field names with type conversion
- Update all tests to reflect field name changes


          feat: add extra_properties field to DatasetModel

99b6812

- Add extra_properties field to store custom metadata
- Support string, int, float, bool, and null values
- Add test for extra_properties validation


          test: update tests for PublicationVenue and Link models

be194b3

- Replace journal parameter with publication_venue in Publication tests
- Add links=[] to all DatasetModel instantiations
- Create PublicationVenue objects in publication tests
- Export Link and PublicationVenue from provenance __init__


          feat: add dataset_id field to DatasetModel

13e8884

- Add dataset_id field to DatasetModel (inherited from PCGL study_id or created in katsu)
- Update PCGL converter to map study.study_id to dataset.dataset_id
- Update all test DatasetModel instantiations to include dataset_id


          chore: add __all__ exports to provenance models

a8ce65e

SanjeevLakhwani force-pushed the feature/dataset-model branch from 98dbf12 to a8ce65e Compare

January 7, 2026 16:53

SanjeevLakhwani added 4 commits

January 7, 2026 12:01


          refactor: replace OntologyTerm with OntologyClass from bento_lib.onto…

92137ec

…logies.models

Migrate provenance models to use the new OntologyClass model which
includes CURIE pattern validation for ontology term IDs.


          refactor: reduce duplication in provenance dataset tests using base f…

af05e12

…ixture


          refactor: remove unused discovery/models/ontology.py

00cfed1


          refactor: extract PCGL study test data to JSON fixtures

f38453a

SanjeevLakhwani requested a review from davidlougheed

January 7, 2026 17:25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet