Skip to content

Commit d5ee502

Browse files
authored
[Add] Initial HealthTable setup and support (#30)
* enabled extension on DataFrames + OMOP CDM load * added omop stub and updated HealthBase accordingly * updated omop cdm extension * added sketch notes on the interface * refactored and updated in docs * added sample tests * added struct healthtable in comments * addons and working test code * made it Tables.jl compatible * resolved review changes and added preprocessing utilities * added other preprocessing functions * added new strategies and tests * updated verification metadata and onehotencoding function * updated map_concepts function * update functions and review changes * updated docs * updated all docs * updated all test and docs * final changes * updated docs, removed unnecessary dependency * julia-actions errors fix * removed stats from runtests * updated tests for code coverage * updated omopcdm ext tests for code coverage * removed redundant tests * updated ext edge case tests for coverage * validation tests to cover remaining lines * Re-run CI
1 parent 3258bce commit d5ee502

20 files changed

+1479
-8
lines changed

Project.toml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,29 @@ uuid = "94e1309d-ccf4-42de-905f-515f1d7b1cae"
33
authors = ["Dilum Aluthge", "contributors"]
44
version = "2.0.0"
55

6+
[deps]
7+
FeatureTransforms = "8fd68953-04b8-4117-ac19-158bf6de9782"
8+
InlineStrings = "842dd82b-1e85-43dc-bf29-5d0ee9dffc48"
9+
OMOPCommonDataModel = "ba65db9e-6590-4054-ab8a-101ed9124986"
10+
PrettyTables = "08abe8d2-0d0c-5749-adfa-8a2ac140af0d"
11+
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
12+
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
13+
614
[weakdeps]
15+
DBInterface = "a10d1c49-ce27-4219-8d33-6db1a4562965"
16+
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
717
DrWatson = "634d3b9d-ee7a-5ddf-bec9-22491ea816e1"
18+
DuckDB = "d2f5444f-75bc-4fdf-ac35-56f514c445e1"
19+
Serialization = "9e88b42a-f829-5b0c-bbe9-9e923198166b"
820

921
[extensions]
1022
HealthBaseDrWatsonExt = "DrWatson"
23+
HealthBaseOMOPCDMExt = ["DataFrames", "OMOPCommonDataModel", "InlineStrings", "Serialization", "Dates", "FeatureTransforms", "DBInterface", "DuckDB"]
1124

1225
[compat]
26+
Dates = "1.10"
27+
PrettyTables = "2.4.0"
28+
Tables = "1.12.1"
1329
julia = "1.10"
1430

1531
[extras]

assets/version_info

422 KB
Binary file not shown.

docs/Project.toml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,21 @@
11
[deps]
2+
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
23
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
34
DocumenterTools = "35a29f4d-8980-5a13-9543-d66fff28ecb8"
5+
DuckDB = "d2f5444f-75bc-4fdf-ac35-56f514c445e1"
6+
FeatureTransforms = "8fd68953-04b8-4117-ac19-158bf6de9782"
47
HealthBase = "94e1309d-ccf4-42de-905f-515f1d7b1cae"
58
LiveServer = "16fef848-5104-11e9-1b77-fb7a48bbb589"
9+
OMOPCommonDataModel = "ba65db9e-6590-4054-ab8a-101ed9124986"
10+
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
11+
12+
[compat]
13+
Documenter = "1"
14+
DocumenterTools = "0.1.10"
15+
HealthBase = "1, 2"
16+
LiveServer = "1"
17+
julia = "1.10"
18+
DuckDB = "1"
19+
FeatureTransforms = "0.4.0"
20+
OMOPCommonDataModel = "0.1"
21+
Tables = "1.12.1"

docs/make.jl

Lines changed: 25 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,21 @@
11
using HealthBase
22
using Documenter
3+
using Tables
4+
using DataFrames
5+
using OMOPCommonDataModel
6+
using FeatureTransforms
7+
using DuckDB
38

4-
DocMeta.setdocmeta!(HealthBase, :DocTestSetup, :(using HealthBase); recursive = true)
9+
DocMeta.setdocmeta!(HealthBase, :DocTestSetup, :(using HealthBase, Tables); recursive = true)
510

611
makedocs(;
7-
modules = [HealthBase],
8-
authors = "Jacob S. Zelko, Dilum Aluthge and contributors",
12+
modules = [
13+
HealthBase,
14+
isdefined(Base, :get_extension) ?
15+
Base.get_extension(HealthBase, :HealthBaseOMOPCDMExt) : HealthBase.HealthBaseOMOPCDMExt
16+
],
17+
checkdocs = :none,
18+
authors = "Jacob S. Zelko, Dilum Aluthge and contributors",
919
repo = "https://github.com/JuliaHealth/HealthBase.jl/blob/{commit}{path}#{line}",
1020
sitename = "HealthBase.jl",
1121
format = Documenter.HTML(;
@@ -15,7 +25,18 @@ makedocs(;
1525
),
1626
pages = [
1727
"Home" => "index.md",
18-
"Workflow Guides" => ["observational_template_workflow.md"],
28+
"Quickstart" => "quickstart.md",
29+
30+
"Workflow Guides" => [
31+
"Observational Template Workflow" => "observational_template_workflow.md",
32+
"OMOP CDM Workflow" => "OMOPCDMWorkflow.md",
33+
],
34+
35+
"HealthTable System" => [
36+
"HealthTable: General Tables.jl Interface" => "HealthTableGeneral.md",
37+
"HealthTable: OMOP CDM Support" => "HealthTableOMOPCDM.md",
38+
"HealthTable: Preprocessing Functions" => "HealthTablePreprocessing.md",
39+
],
1940
"API" => "api.md",
2041
],
2142
# TODO: Update and configure doctests before next release

docs/src/HealthTableGeneral.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# HealthTable: Tables.jl Interface (General)
2+
3+
## The `HealthTable` Struct
4+
5+
The core of the interface is the `HealthTable` struct.
6+
7+
```@docs
8+
HealthBase.HealthTable
9+
```
10+
11+
## `Tables.jl` API Implementation
12+
13+
The `HealthTable` wrapper types will implement key `Tables.jl` methods:
14+
15+
`HealthTable` implements the `Tables.jl` interface to ensure compatibility with the Julia data ecosystem:
16+
17+
```@docs
18+
Tables.istable(::Type{<:HealthBase.HealthTable})
19+
Tables.rowaccess(::Type{<:HealthBase.HealthTable})
20+
Tables.rows(::HealthBase.HealthTable)
21+
Tables.columnaccess(::Type{<:HealthBase.HealthTable})
22+
Tables.columns(::HealthBase.HealthTable)
23+
Tables.schema(::HealthBase.HealthTable)
24+
Tables.materializer(::Type{<:HealthBase.HealthTable})
25+
```
26+
27+
Source: https://tables.juliadata.org/stable/implementing-the-interface/

docs/src/HealthTableOMOPCDM.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# OMOP CDM Support for HealthTable
2+
3+
## Core Goals & Features
4+
5+
The `HealthTable` interface in `HealthBase.jl` is designed to make working with OMOP CDM data in Julia easy, robust, and compatible with the `Tables.jl` ecosystem. The key features include:
6+
7+
- **Schema-Aware Validation**: Instead of just wrapping your data, `HealthTable` actively validates it against the official OMOP CDM specification using `OMOPCommonDataModel.jl`. This includes:
8+
- **Column Type Enforcement**: Verifies that column types in the input `DataFrame` match the official OMOP schema (e.g., `person_id` is `Int64`, `condition_start_date` is `Date`).
9+
- **Clear Error Reporting**: If mismatches exist, the constructor returns detailed messages about all invalid columns or can emit warnings if type enforcement is disabled.
10+
- **Metadata Attachment**: Attaches OMOP metadata (like `cdmDatatype`, `standardConcept`, etc.) directly to each validated column.
11+
12+
- **Preprocessing Utilities**: Built-in tools for data preparation include:
13+
- `one_hot_encode`: One-hot encodes categorical variables using `FeatureTransforms.jl`.
14+
- `apply_vocabulary_compression`: Groups rare categorical values under a shared `"Other"` label.
15+
- `map_concepts`: Maps concept IDs to human-readable concept names using a DuckDB-backed `concept` table.
16+
- `map_concepts!`: An in-place variant of concept mapping that modifies the existing table.
17+
18+
- **Tables.jl Compatibility**: The `HealthTable` type implements the full `Tables.jl` interface so it can be used with any downstream package in the Julia data ecosystem.
19+
20+
- **JuliaHealth Integration**: Designed to interoperate seamlessly with current and future JuliaHealth tools and projects.
21+
22+
- **Extensible Foundation**: The core architecture is extensible future support could include streaming, direct DuckDB views, or remote OMOP datasets.
23+
24+
25+
## `Tables.jl` Interface Sketch
26+
27+
The `HealthTable` type is the main interface for working with OMOP CDM tables. You construct it by passing in a `DataFrame` and optionally specifying a CDM version. The constructor will validate the schema and attach metadata. The resulting object:
28+
29+
- Is a wrapper over the validated DataFrame (`ht.source`),
30+
- Provides schema-aware access to data,
31+
- Can be used anywhere a `Tables.jl`-compatible table is expected.
32+
33+
This eliminates the need for a separate wrapping step the constructor itself ensures conformance and returns a ready-to-use tabular object.
34+
35+
In future extensions, similar wrappers could be created for other data sources, such as database queries or streaming sources. These types would implement the same `Tables.jl` interface to support composable workflows.
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# HealthTable: Preprocessing Functions
2+
3+
This page documents the preprocessing and transformation functions available for `HealthTable` objects when working with OMOP CDM data. These functions are provided by the OMOP CDM extension and enable data preparation workflows for machine learning and analysis.
4+
5+
## One-Hot Encoding
6+
7+
Transform categorical variables into binary indicator columns suitable for machine learning algorithms.
8+
9+
```@docs
10+
HealthBase.one_hot_encode
11+
```
12+
13+
## Vocabulary Compression
14+
15+
Reduce the dimensionality of categorical variables by grouping infrequent levels under a common label.
16+
17+
```@docs
18+
HealthBase.apply_vocabulary_compression
19+
```
20+
21+
## Concept Translation
22+
23+
### Concept Mapping (Immutable)
24+
25+
Map OMOP concept IDs to human-readable concept names using the OMOP vocabulary tables, returning a new `HealthTable`.
26+
27+
```@docs
28+
HealthBase.map_concepts
29+
```
30+
31+
### Concept Mapping (In-Place)
32+
33+
In-place version of concept mapping that modifies the original `HealthTable` directly for memory efficiency.
34+
35+
```@docs
36+
HealthBase.map_concepts!
37+
```

docs/src/OMOPCDMWorkflow.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# OMOP CDM Workflow with HealthTable
2+
3+
## Typical Workflow
4+
5+
The envisioned process for working with OMOP CDM data using the `HealthBase.jl` components typically follows these steps:
6+
7+
1. **Data Loading**
8+
Raw data is loaded into a suitable tabular structure, most commonly a `DataFrame`.
9+
10+
2. **Validation and Wrapping with `HealthTable`**
11+
The raw `DataFrame` is then wrapped using `HealthBase.HealthTable`. This function takes the `DataFrame` and uses the attached OMOP CDM version (e.g., "v5.4.1") to validate its structure and column types against the OMOP CDM schema.
12+
13+
- It checks if the column types are compatible with the expected OMOP CDM types (from `OMOPCommonDataModel.jl`).
14+
- If `disable_type_enforcement = false`, it will throw errors on mismatches or attempt safe conversions.
15+
- It attaches metadata to columns indicating their OMOP CDM types.
16+
- The result is a `HealthTable` instance that wraps the validated `DataFrame` and exposes the `Tables.jl` interface.
17+
18+
3. **Interacting via `Tables.jl`**
19+
Once wrapped, the `HealthTable` instance can be seamlessly used with any `Tables.jl`-compatible tools and standard `Tables.jl` functions.
20+
21+
4. **Applying Preprocessing Utilities**
22+
After wrapping, you can apply preprocessing steps essential for analysis or modeling. These include:
23+
24+
- One-hot encoding
25+
- Handling of high-cardinality categorical variables
26+
- Concept mapping utilities
27+
28+
These utilities usually return a modified `HealthTable` or a materialized `DataFrame` ready for downstream use.
29+
30+
## Example Usage
31+
32+
```julia
33+
using DataFrames, OMOPCommonDataModel, InlineStrings, Serialization, Dates, FeatureTransforms, DBInterface, DuckDB
34+
using HealthBase
35+
36+
# Assume 'condition_occurrence_df' is a DataFrame loaded from a CSV/database
37+
condition_occurrence_df = DataFrame(
38+
condition_occurrence_id = [1, 2, 3],
39+
person_id = [101, 102, 101],
40+
condition_concept_id = [201826, 433736, 317009],
41+
condition_start_date = [Date(2010,1,1), Date(2012,5,10), Date(2011,3,15)]
42+
# ... other fields
43+
)
44+
45+
# Validate and wrap the DataFrame with HealthTable
46+
ht_conditions = HealthTable(condition_occurrence_df; omop_cdm_version="v5.4.1")
47+
48+
# 1. Schema Inspection
49+
sch = Tables.schema(ht_conditions)
50+
println("Schema Names: ", sch.names)
51+
println("Schema Types: ", sch.types)
52+
# This should output the names and types from the validated DataFrame
53+
54+
# 2. Iteration (Rows)
55+
for row in Tables.rows(ht_conditions)
56+
# 'row' is a Tables.Row, with fields matching the OMOP schema
57+
println("Person ID: $(row.person_id), Condition: $(row.condition_concept_id)")
58+
end
59+
60+
# 3. Integration with other packages (example: MLJ.jl)
61+
# 4. Materialization
62+
# DataFrame(ht_conditions)
63+
```
64+
65+
## Preprocessing and Utilities
66+
67+
Preprocessing utilities can operate on `HealthTable` objects (or their materialized versions), leveraging the `Tables.jl` interface and schema awareness derived via `Tables.schema`.
68+
69+
Examples include:
70+
71+
- `one_hot_encode(ht::HealthTable, column_symbol::Symbol; drop_original=true)`
72+
- `apply_vocabulary_compression(ht::HealthTable, column_symbol::Symbol, mapping_dict::Dict)`
73+
- `map_concepts(ht::HealthTable, column_symbol::Symbol, concept_map::AbstractDict)`
74+
- `map_concepts!(ht::HealthTable, column_symbol::Symbol, concept_map::AbstractDict)` *(in-place version)*
75+
76+
These functions follow the principle of user-triggered, optional transformations configurable via keyword arguments.

docs/src/api.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,12 @@ CurrentModule = HealthBase
99

1010
```@autodocs
1111
Modules = [HealthBase]
12+
Filter = t -> !(t in [HealthBase.HealthTable,
13+
Base.getproperty(Tables, :columns),
14+
Base.getproperty(Tables, :rows),
15+
Base.getproperty(Tables, :schema),
16+
Base.getproperty(Tables, :istable),
17+
Base.getproperty(Tables, :rowaccess),
18+
Base.getproperty(Tables, :columnaccess),
19+
Base.getproperty(Tables, :materializer)])
1220
```

0 commit comments

Comments
 (0)