Project Title: Gradient Boosted Trees & Feature Engineering - Explanatory and Predictive Modeling of Stroke Data (Health)
This project analyzes health data to understand the factors influencing strokes. It involves data preprocessing, exploratory data analysis (EDA), statistical inference, and machine learning techniques to develop models for the classification task of predicting stroke risk. The aim is to build accurate predictive models using Gradient Boosted Trees and advanced feature engineering to improve stroke risk prediction.
The primary objective is to identify key features and build robust machine learning models to predict stroke risk. This project also aims to:
-
Explore the relationships between patient features, including age, BMI, glucose level, work type, hypertension, heart disease, and more.
-
Perform statistical inference to uncover significant relationships and factors associated with stroke risk.
-
Identify the most important predictors for stroke risk and evaluate the predictive power of these features.
-
Build and evaluate Gradient Boosted Trees models to predict the likelihood of stroke.
-
Age is the most decisive predictor of stroke risk in this dataset, with a sharp increase in risk after 40 and again after 65.
-
Diabetes (high glucose levels) is a significant predictor, while BMI and marital status show a notable relationship primarily at the age-ratio scale. Other key risk factors include hypertension and work type. In contrast, gender, heart disease, former smoking, and residence location do not appear to have an impact."
-
Models faced challenges in predicting strokes due to class imbalance, even after implementing algorithm-based solutions during training to address this issue. A simple age-only regression model (baseline) achieved 80% recall but low precision (PR AUC: 19.5%), while the best model (XGBoost) improved to 84% recall and PR AUC of 24.6%. Adjusting the decision threshold slightly improved precision at the cost of recall.
The dataset’s snapshot nature leads to censoring issues, as it does not track patients over time. This means past strokes (left-censoring) and future strokes (right-censoring) are not properly accounted for, making the data unreliable for true risk prediction. To address this, the project reframes the problem as risk assessment, using feature engineering to introduce clinical risk scores and composite health indicators while maintaining "stroke" as the target variable.
The model is deployed via an API that allows real-time predictions by forwarding requests to the local machine.
To set up this project locally:
- Clone the repository:
git clone https://github.com/razzf/stroke-prediction-machine-learning
- Navigate to the project directory:
cd stroke-prediction-machine-learning - Install required packages:
Ensure Python is installed and use the following command:
pip install -r requirements.txt
Open the notebook in Jupyter or JupyterLab to explore the analysis. Execute the cells sequentially to understand the workflow, from data exploration to model building and evaluation. For an in-depth exploration, refer to the notebook overview below.
The dataset is located in the /data directory. It is originally derived from Kaggle. The data set reflects a collection of personal health data from an unknown source. It contains data of 5.110 people/patients for 10 features (e.g. age, BMI, average glucose level, work type, residence type, smoke type, if a person has hypertension or heart disease, was ever married, etc.) and one variable containing information if the person ever had a stroke or not.
project-root/
├── api/
│ └── app.py # API script for model deployment
├── custom_modules/
│ ├── custom_transformers.py # Module for custom pipeline transformers
│ ├── plotting.py # Module for plotting visualizations
│ └── stat_calculations.py # Module for statistical calculations
├── data/
│ ├── healthcare-dataset-stroke-dataon.csv # Stroke prediction dataset
│ └── stroke_data_prepared.pkl # Prepared dataset after cleaning and preprocessing (notebook_1), used for training
├── notebook/
│ ├── data preparation, EDA, statistical inference.ipynb # Jupyter notebook_1 for data prep, EDA, and statistical inference
│ └── machine learning modeling.ipynb # Jupyter notebook_2 for machine learning modeling and evaluation
├── Dockerfile # Instructions for setting up the environment for deployment
├── optimized_model.pkl # Trained model and optimized threshold for prediction
├── requirements.txt # Python dependencies
└── README.md # Project documentation
The requirements.txt file lists all Python dependencies. Install them using the command provided above.
The notebooks include the following sections:
Notebook 1: Data Preparation, EDA, and Statistical Inference
- Introduction
- Problem Discovery
- Data Acquisition
- Exploratory Data Analysis
- Statistical Inference and Evaluation
- Suggestions for Improvement
Notebook 2: Machine Learning Modeling
- Introduction
- Further Data Preparation
- Feature Engineering
- Model Training, Evaluation, and Tuning
- Deployment