This repo uses Databricks Lakeflow Declarative Pipelines for end-to-end, production-grade ETL with minimal operational overhead. The project is perfect for data engineers, analysts, and healthcare professionals looking to ramp up on both modern lakehouse technology and payer-specific analytics.
Important
This bundle uses Serverless compute, so make sure that it's enabled for your workspace (works on Databricks Free Edition as well). If it's not, then you need to adjust parameters of the job and DLT pipelines!
You can install the project two ways:
- Using Databricks Assset Bundles (DABs) inside the Databricks Workspace (recommended):
- Using DABs from the command line of your computer
-
Create a Git Folder inside your Databricks workspace by cloning this repository.
-
Open the
payer_dlt/databricks.yamlinside create Git Folder. -
Adjust the following parameters inside the
databricks.yaml(create necessary objects before use):
catalog_name- the name of the existing UC Catalog used in configuration.bronze_schema_name- the name of an existing UC Schema to put raw data.silver_schema_name- the name of an existing UC Schema to put tables with transformed data.gold_schema_name- the name of an existing UC Schema to put tables with reporting data.
-
Click Deploy button in the Deployments tab on the left - this will create necessary jobs and pipelines
-
Click Run button next to the
DLT Payer Demo: Setupjob. -
Click Start pipeline for DLT pipelines to process data and run detections (in the following order):
DLT Payer Demo: Ingest Bronze dataDLT Payer Demo: Ingest Silver dataDLT Payer Demo: Ingest Gold data
-
Install the latest version of Databricks CLI.
-
Authenticate to your Databricks workspace, if you have not done so already:
databricks configure- Set environment variable
DATABRICKS_CONFIG_PROFILEto the name of Databricks CLI profile you configured, and configure necessary variables in thedevprofile ofdatabricks.ymlfile. You need to specify the following (create necessary objects before use):
catalog_name- the name of the existing UC Catalog used in configuration.bronze_schema_name- the name of an existing UC Schema to put raw data.silver_schema_name- the name of an existing UC Schema to put tables with transformed data.gold_schema_name- the name of an existing UC Schema to put tables with reporting data.
- To deploy a development copy of this project, type:
databricks bundle deploy- Run a job to set up the normalized tables and download sample log files:
databricks bundle run dlt_payer_demo_setup- Run DLT pipelines to ingest data in bronze, silver and gold tiers:
databricks bundle run ingest_payer_bronze_data
databricks bundle run ingest_payer_bronze_data_silver
databricks bundle run ingest_payer_silver_data_gold