End-to-end data pipeline on GCP, using Airflow for ingestion and dbt for transformations. The system loads weather data from OpenWeatherMap, stores raw payloads in GCS/BigQuery, and produces analytics-ready tables for BI.
OWM API
↓
Airflow (batch ingestion)
↓
GCS Raw (bronze)
↓
BigQuery Raw
↓
dbt: stg → int → marts
↓
BigQuery Clean (silver/gold)
↓
Airflow (dbt post run checks)
↓
Looker
- Endpoints:
/weather,/forecast - JSON responses
- API key authentication
- All timestamps normalized to UTC
-
DAG:
owm_batch_bq.py -
Schedule: every 4 hours (
0 */4 * * *) -
Retry logic enabled
-
Idempotent writes to GCS and BigQuery
-
Logical time aligned to 4-hour windows
-
DAG:
dbt_monitoring.py -
Schedule: every 4 hours at HH:30 (30 */4 * * *) — it starts 30 minutes after the ingestion DAG to ensure the full pipeline (ingestion → dbt) has finished.
-
Purpose: Post-dbt data quality checks (row counts, freshness, max timestamp validations)
-
Triggers: Runs according to its own schedule after dbt models are materialized
-
Alerts: Sends email on failure
-
Scope: Checks staging → intermediate → marts tables in BigQuery
- NDJSON files, 1 per ingestion run
- Naming:
current_YYYYMMDD_HH.ndjson - Retention: 30 days
- Contents:
{fetched_at, source, data}
-
Dataset:
raw -
Schema:
fetched_at TIMESTAMPsource STRINGdata JSON
-
Daily partitioning on
fetched_at -
Append-only, duplicate-safe
- Separate service accounts for Airflow and dbt
- Least-privilege access (BQ + GCS scoped per component)
- No secrets stored in the repository
- Local development uses environment variables (e.g.,
GOOGLE_APPLICATION_CREDENTIALS)
- stg — normalization, renaming, typing, timestamp cleanup
- int — unified structure for current + forecast data
- marts — curated fact/dimension tables
uniquenot_nullaccepted_valuesrelationships(referential integrity)
-
check_duplicates -
check_rain_snow_logic -
check_timestamps -
check_tmp_wind_range -
dbt job runs 15 minutes after ingestion DAG completion (scheduled at HH:16).
-
Dataset:
clean -
Tables:
- staging layer (normalized raw data)
- intermediate layer (unified and enriched transformations)
- marts layer (analytics-ready dimensional + fact structures)
- metadata for data-quality monitoring
-
Materialization: stg + int → views, marts + metadata → tables
-
Staging, intermediate, and metadata views reside in the same clean dataset as marts to simplify process due to the small dataset size
(To be added) Dashboards for temperature, humidity, precipitation, and forecast accuracy.
This repository covers:
- Ingestion pipeline design (Airflow → GCS → BigQuery)
- Raw → staged → modeled ELT flow using dbt
- Data modeling: grain definition, unified schema, typed fields
- Quality controls: schema tests, referential checks, and custom validations
- Monitoring: scheduled DAG for post-dbt data quality checks with alerts on failure
- Secure execution with isolated service accounts
Completed:
- Airflow batch ingestion
- GCS/BigQuery raw layers
- dbt project setup
- staging models + schema tests
- intermediate model
- unified weather record
- marts (fact/dim)
- Airflow dbt post run check
Planned:
- streaming simulation (Pub/Sub + Cloud Run): canceled due to free-tier limitations
- Looker dashboard
- CI/CD via GitHub Actions