Project Title: Gradient Boosted Trees & Feature Engineering - Explanatory and Predictive Modeling of Stroke Data (Health)

Project Overview

This project analyzes health data to understand the factors influencing strokes. It involves data preprocessing, exploratory data analysis (EDA), statistical inference, and machine learning techniques to develop models for the classification task of predicting stroke risk. The aim is to build accurate predictive models using Gradient Boosted Trees and advanced feature engineering to improve stroke risk prediction.

Objectives

The primary objective is to identify key features and build robust machine learning models to predict stroke risk. This project also aims to:

Explore the relationships between patient features, including age, BMI, glucose level, work type, hypertension, heart disease, and more.
Perform statistical inference to uncover significant relationships and factors associated with stroke risk.
Identify the most important predictors for stroke risk and evaluate the predictive power of these features.
Build and evaluate Gradient Boosted Trees models to predict the likelihood of stroke.

Key Insights

Age is the most decisive predictor of stroke risk in this dataset, with a sharp increase in risk after 40 and again after 65.
Diabetes (high glucose levels) is a significant predictor, while BMI and marital status show a notable relationship primarily at the age-ratio scale. Other key risk factors include hypertension and work type. In contrast, gender, heart disease, former smoking, and residence location do not appear to have an impact."
Models faced challenges in predicting strokes due to class imbalance, even after implementing algorithm-based solutions during training to address this issue. A simple age-only regression model (baseline) achieved 80% recall but low precision (PR AUC: 19.5%), while the best model (XGBoost) improved to 84% recall and PR AUC of 24.6%. Adjusting the decision threshold slightly improved precision at the cost of recall.

Important Note: Dataset Limitations and Censoring Issues

The dataset’s snapshot nature leads to censoring issues, as it does not track patients over time. This means past strokes (left-censoring) and future strokes (right-censoring) are not properly accounted for, making the data unreliable for true risk prediction. To address this, the project reframes the problem as risk assessment, using feature engineering to introduce clinical risk scores and composite health indicators while maintaining "stroke" as the target variable.

Model Deployment

The model is deployed via an API that allows real-time predictions by forwarding requests to the local machine.

Installation

To set up this project locally:

Clone the repository:

git clone https://github.com/razzf/stroke-prediction-machine-learning

Navigate to the project directory:
```
cd stroke-prediction-machine-learning
```
Install required packages: Ensure Python is installed and use the following command:
```
pip install -r requirements.txt
```

Usage

Open the notebook in Jupyter or JupyterLab to explore the analysis. Execute the cells sequentially to understand the workflow, from data exploration to model building and evaluation. For an in-depth exploration, refer to the notebook overview below.

Data

The dataset is located in the /data directory. It is originally derived from Kaggle. The data set reflects a collection of personal health data from an unknown source. It contains data of 5.110 people/patients for 10 features (e.g. age, BMI, average glucose level, work type, residence type, smoke type, if a person has hypertension or heart disease, was ever married, etc.) and one variable containing information if the person ever had a stroke or not.

Directory Structure

project-root/
├── api/
│   └── app.py                         # API script for model deployment
├── custom_modules/
│   ├── custom_transformers.py         # Module for custom pipeline transformers
│   ├── plotting.py                    # Module for plotting visualizations
│   └── stat_calculations.py           # Module for statistical calculations
├── data/
│   ├── healthcare-dataset-stroke-dataon.csv  # Stroke prediction dataset
│   └── stroke_data_prepared.pkl              # Prepared dataset after cleaning and preprocessing (notebook_1), used for training
├── notebook/
│   ├── data preparation, EDA, statistical inference.ipynb   # Jupyter notebook_1 for data prep, EDA, and statistical inference
│   └── machine learning modeling.ipynb                      # Jupyter notebook_2 for machine learning modeling and evaluation
├── Dockerfile                         # Instructions for setting up the environment for deployment
├── optimized_model.pkl                # Trained model and optimized threshold for prediction
├── requirements.txt                   # Python dependencies
└── README.md                          # Project documentation

Requirements

The requirements.txt file lists all Python dependencies. Install them using the command provided above.

Notebook Overview

The notebooks include the following sections:

Notebook 1: Data Preparation, EDA, and Statistical Inference

Introduction
Problem Discovery
Data Acquisition
Exploratory Data Analysis
Statistical Inference and Evaluation
Suggestions for Improvement

Notebook 2: Machine Learning Modeling

Introduction
Further Data Preparation
Feature Engineering
Model Training, Evaluation, and Tuning
Deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project Title: Gradient Boosted Trees & Feature Engineering - Explanatory and Predictive Modeling of Stroke Data (Health)

Project Overview

Objectives

Key Insights

Important Note: Dataset Limitations and Censoring Issues

Model Deployment

Table of Contents

Installation

Usage

Data

Directory Structure

Requirements

Notebook Overview

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
api		api
custom_modules		custom_modules
data		data
notebooks		notebooks
Dockerfile		Dockerfile
README.md		README.md
optimized_model.pkl		optimized_model.pkl
project_presentation.pptx		project_presentation.pptx
requirements.txt		requirements.txt

razzf/stroke-prediction-machine-learning

Folders and files

Latest commit

History

Repository files navigation

Project Title: Gradient Boosted Trees & Feature Engineering - Explanatory and Predictive Modeling of Stroke Data (Health)

Project Overview

Objectives

Key Insights

Important Note: Dataset Limitations and Censoring Issues

Model Deployment

Table of Contents

Installation

Usage

Data

Directory Structure

Requirements

Notebook Overview

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages