HelicoPredict-new is a research-oriented project focused on identifying resistance biomarkers in Helicobacter pylori strains by applying various machine learning (ML) methods to mutation data. Our approach demonstrates that gene-wise aggregated association feature selection yields the most generalizable predictive performance and uncovers several loci with biological relevance to resistance.
Helicobacter pylori is a common bacterium linked to various gastrointestinal diseases. Understanding genetic mutations that confer antibiotic resistance is crucial for effective treatment. This project aims to:
- Apply multiple machine learning algorithms to mutation data from H. pylori strains.
- Systematically evaluate feature selection methods, with an emphasis on gene-wise aggregated association.
- Identify biomarkers and loci associated with antibiotic resistance, providing biological and clinical insights.
- Gene-wise Aggregated Association Feature Selection: This method outperforms other feature selection strategies in terms of generalizability and predictive accuracy.
- Biologically Relevant Loci: The approach successfully identifies key loci associated with resistance, validated by biological evidence.
- Data preprocessing and mutation encoding
- Implementation of various ML models (e.g., Random Forest, SVM, XGBoost, Feed-forward neural network)
- Proposed performance aggregation and enemble models to get more robus evaluation
- Visualization and interpretation of ML results, i.e., SHAP scores
- Biological interpretation on putative resistance biomarkers
- Python 3.8+
- Encode data into Category A (SNV), Category B (Asynomous amino acid mutation), and Category C (loss of function)
Clone this repository:
git clone https://github.com/DiyuanLu/HelicoPredict-new.git
cd HelicoPredict-newInstall required packages:
pip install -r requirements.txt-
Prepare your mutation data in the expected format (see
data/directory for examples). -
Run the main analysis script:
python cluster_run_main.py
-
Outputs and results will be saved in the
results/directory.
HelicoPredict-new/
├── data/ # Input data and data examples
├── src/ # Core scripts and modules
├── results/ # Output and result files
├── requirements