|
| 1 | +# OMOP CDM Workflow with HealthTable |
| 2 | + |
| 3 | +## Typical Workflow |
| 4 | + |
| 5 | +The envisioned process for working with OMOP CDM data using the `HealthBase.jl` components typically follows these steps: |
| 6 | + |
| 7 | +1. **Data Loading** |
| 8 | + Raw data is loaded into a suitable tabular structure, most commonly a `DataFrame`. |
| 9 | + |
| 10 | +2. **Validation and Wrapping with `HealthTable`** |
| 11 | + The raw `DataFrame` is then wrapped using `HealthBase.HealthTable`. This function takes the `DataFrame` and uses the attached OMOP CDM version (e.g., "v5.4.1") to validate its structure and column types against the OMOP CDM schema. |
| 12 | + |
| 13 | + - It checks if the column types are compatible with the expected OMOP CDM types (from `OMOPCommonDataModel.jl`). |
| 14 | + - If `disable_type_enforcement = false`, it will throw errors on mismatches or attempt safe conversions. |
| 15 | + - It attaches metadata to columns indicating their OMOP CDM types. |
| 16 | + - The result is a `HealthTable` instance that wraps the validated `DataFrame` and exposes the `Tables.jl` interface. |
| 17 | + |
| 18 | +3. **Interacting via `Tables.jl`** |
| 19 | + Once wrapped, the `HealthTable` instance can be seamlessly used with any `Tables.jl`-compatible tools and standard `Tables.jl` functions. |
| 20 | + |
| 21 | +4. **Applying Preprocessing Utilities** |
| 22 | + After wrapping, you can apply preprocessing steps essential for analysis or modeling. These include: |
| 23 | + |
| 24 | + - One-hot encoding |
| 25 | + - Handling of high-cardinality categorical variables |
| 26 | + - Concept mapping utilities |
| 27 | + |
| 28 | + These utilities usually return a modified `HealthTable` or a materialized `DataFrame` ready for downstream use. |
| 29 | + |
| 30 | +## Example Usage |
| 31 | + |
| 32 | +```julia |
| 33 | +using DataFrames, OMOPCommonDataModel, InlineStrings, Serialization, Dates, FeatureTransforms, DBInterface, DuckDB |
| 34 | +using HealthBase |
| 35 | + |
| 36 | +# Assume 'condition_occurrence_df' is a DataFrame loaded from a CSV/database |
| 37 | +condition_occurrence_df = DataFrame( |
| 38 | + condition_occurrence_id = [1, 2, 3], |
| 39 | + person_id = [101, 102, 101], |
| 40 | + condition_concept_id = [201826, 433736, 317009], |
| 41 | + condition_start_date = [Date(2010,1,1), Date(2012,5,10), Date(2011,3,15)] |
| 42 | + # ... other fields |
| 43 | +) |
| 44 | + |
| 45 | +# Validate and wrap the DataFrame with HealthTable |
| 46 | +ht_conditions = HealthTable(condition_occurrence_df; omop_cdm_version="v5.4.1") |
| 47 | + |
| 48 | +# 1. Schema Inspection |
| 49 | +sch = Tables.schema(ht_conditions) |
| 50 | +println("Schema Names: ", sch.names) |
| 51 | +println("Schema Types: ", sch.types) |
| 52 | +# This should output the names and types from the validated DataFrame |
| 53 | + |
| 54 | +# 2. Iteration (Rows) |
| 55 | +for row in Tables.rows(ht_conditions) |
| 56 | + # 'row' is a Tables.Row, with fields matching the OMOP schema |
| 57 | + println("Person ID: $(row.person_id), Condition: $(row.condition_concept_id)") |
| 58 | +end |
| 59 | + |
| 60 | +# 3. Integration with other packages (example: MLJ.jl) |
| 61 | +# 4. Materialization |
| 62 | +# DataFrame(ht_conditions) |
| 63 | +``` |
| 64 | + |
| 65 | +## Preprocessing and Utilities |
| 66 | + |
| 67 | +Preprocessing utilities can operate on `HealthTable` objects (or their materialized versions), leveraging the `Tables.jl` interface and schema awareness derived via `Tables.schema`. |
| 68 | + |
| 69 | +Examples include: |
| 70 | + |
| 71 | +- `one_hot_encode(ht::HealthTable, column_symbol::Symbol; drop_original=true)` |
| 72 | +- `apply_vocabulary_compression(ht::HealthTable, column_symbol::Symbol, mapping_dict::Dict)` |
| 73 | +- `map_concepts(ht::HealthTable, column_symbol::Symbol, concept_map::AbstractDict)` |
| 74 | +- `map_concepts!(ht::HealthTable, column_symbol::Symbol, concept_map::AbstractDict)` *(in-place version)* |
| 75 | + |
| 76 | +These functions follow the principle of user-triggered, optional transformations configurable via keyword arguments. |
0 commit comments