Skip to content

razzf/stroke-prediction-machine-learning

Repository files navigation

Project Title: Gradient Boosted Trees & Feature Engineering - Explanatory and Predictive Modeling of Stroke Data (Health)

Project Overview

This project analyzes health data to understand the factors influencing strokes. It involves data preprocessing, exploratory data analysis (EDA), statistical inference, and machine learning techniques to develop models for the classification task of predicting stroke risk. The aim is to build accurate predictive models using Gradient Boosted Trees and advanced feature engineering to improve stroke risk prediction.

Objectives

The primary objective is to identify key features and build robust machine learning models to predict stroke risk. This project also aims to:

  • Explore the relationships between patient features, including age, BMI, glucose level, work type, hypertension, heart disease, and more.

  • Perform statistical inference to uncover significant relationships and factors associated with stroke risk.

  • Identify the most important predictors for stroke risk and evaluate the predictive power of these features.

  • Build and evaluate Gradient Boosted Trees models to predict the likelihood of stroke.

Key Insights

  • Age is the most decisive predictor of stroke risk in this dataset, with a sharp increase in risk after 40 and again after 65.

  • Diabetes (high glucose levels) is a significant predictor, while BMI and marital status show a notable relationship primarily at the age-ratio scale. Other key risk factors include hypertension and work type. In contrast, gender, heart disease, former smoking, and residence location do not appear to have an impact."

  • Models faced challenges in predicting strokes due to class imbalance, even after implementing algorithm-based solutions during training to address this issue. A simple age-only regression model (baseline) achieved 80% recall but low precision (PR AUC: 19.5%), while the best model (XGBoost) improved to 84% recall and PR AUC of 24.6%. Adjusting the decision threshold slightly improved precision at the cost of recall.

Important Note: Dataset Limitations and Censoring Issues

The dataset’s snapshot nature leads to censoring issues, as it does not track patients over time. This means past strokes (left-censoring) and future strokes (right-censoring) are not properly accounted for, making the data unreliable for true risk prediction. To address this, the project reframes the problem as risk assessment, using feature engineering to introduce clinical risk scores and composite health indicators while maintaining "stroke" as the target variable.

Model Deployment

The model is deployed via an API that allows real-time predictions by forwarding requests to the local machine.

Table of Contents

Installation

To set up this project locally:

  1. Clone the repository:
    git clone https://github.com/razzf/stroke-prediction-machine-learning
  2. Navigate to the project directory:
    cd stroke-prediction-machine-learning
  3. Install required packages: Ensure Python is installed and use the following command:
    pip install -r requirements.txt

Usage

Open the notebook in Jupyter or JupyterLab to explore the analysis. Execute the cells sequentially to understand the workflow, from data exploration to model building and evaluation. For an in-depth exploration, refer to the notebook overview below.

Data

The dataset is located in the /data directory. It is originally derived from Kaggle. The data set reflects a collection of personal health data from an unknown source. It contains data of 5.110 people/patients for 10 features (e.g. age, BMI, average glucose level, work type, residence type, smoke type, if a person has hypertension or heart disease, was ever married, etc.) and one variable containing information if the person ever had a stroke or not.

Directory Structure

project-root/
├── api/
│   └── app.py                         # API script for model deployment
├── custom_modules/
│   ├── custom_transformers.py         # Module for custom pipeline transformers
│   ├── plotting.py                    # Module for plotting visualizations
│   └── stat_calculations.py           # Module for statistical calculations
├── data/
│   ├── healthcare-dataset-stroke-dataon.csv  # Stroke prediction dataset
│   └── stroke_data_prepared.pkl              # Prepared dataset after cleaning and preprocessing (notebook_1), used for training
├── notebook/
│   ├── data preparation, EDA, statistical inference.ipynb   # Jupyter notebook_1 for data prep, EDA, and statistical inference
│   └── machine learning modeling.ipynb                      # Jupyter notebook_2 for machine learning modeling and evaluation
├── Dockerfile                         # Instructions for setting up the environment for deployment
├── optimized_model.pkl                # Trained model and optimized threshold for prediction
├── requirements.txt                   # Python dependencies
└── README.md                          # Project documentation

Requirements

The requirements.txt file lists all Python dependencies. Install them using the command provided above.

Notebook Overview

The notebooks include the following sections:

Notebook 1: Data Preparation, EDA, and Statistical Inference

  1. Introduction
  2. Problem Discovery
  3. Data Acquisition
  4. Exploratory Data Analysis
  5. Statistical Inference and Evaluation
  6. Suggestions for Improvement

Notebook 2: Machine Learning Modeling

  1. Introduction
  2. Further Data Preparation
  3. Feature Engineering
  4. Model Training, Evaluation, and Tuning
  5. Deployment

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages