Skip to content

An intelligent, dataset-agnostic AutoML system that learns from experience using meta-learning to automatically select the best models and preprocessing strategies.

Notifications You must be signed in to change notification settings

Chunduri-Aditya/MetaLearnML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

MetaLearnML

An intelligent, dataset-agnostic AutoML system that learns from experience using meta-learning to automatically select the best models and preprocessing strategies.

MetaLearnML is a complete, self-improving AutoML engine that automatically understands datasets, selects preprocessing strategies, chooses models, and learns from experience using meta-learning. Unlike traditional AutoML tools, MetaLearnML uses meta-learning to predict model performance based on past experiments, making it 10x faster than brute-force search while getting smarter with every experiment.

๐ŸŽฏ Overview

This is a dataset-agnostic thinking machine that:

  • โœ… Automatically infers task type (regression vs classification)
  • โœ… Generates and ranks preprocessing strategies intelligently
  • โœ… Selects appropriate models based on task type
  • โœ… Uses meta-learning to predict performance and guide search
  • โœ… Learns from every experiment to improve over time
  • โœ… Generates comprehensive reports (Markdown, JSON, Text)
  • โœ… Supports parallel execution for faster training
  • โœ… Includes Neo4j graph database for experiment tracking

๐Ÿš€ Quick Start

Installation

# Clone the repository
git clone <your-repo-url>
cd AutoMLProject

# Install dependencies
pip install -r requirements.txt

Basic Usage

# Interactive mode
python main.py
# Select [8] Dataset-agnostic mode
# Select [2] Intelligent mode
# Enter your CSV file path
# Select metric (or press Enter for auto-detect)

# Autopilot mode (non-interactive)
python main.py --autopilot data/adult.csv --label income --metric accuracy
python main.py --autopilot data/paddydataset.csv --metric r2

๐Ÿ“ Project Structure

AutoMLProject/
โ”œโ”€โ”€ main.py                          # CLI entry point
โ”œโ”€โ”€ engine/
โ”‚   โ”œโ”€โ”€ orchestrator.py              # Main pipeline coordinator
โ”‚   โ”œโ”€โ”€ intelligent_search.py       # Search engine
โ”‚   โ”œโ”€โ”€ intelligent_search_parallel.py  # Parallel search
โ”‚   โ”œโ”€โ”€ meta_learner.py             # Meta-learning brain
โ”‚   โ”œโ”€โ”€ preprocessing_strategies.py # Preprocessing generation
โ”‚   โ”œโ”€โ”€ trainer.py                  # Model training
โ”‚   โ”œโ”€โ”€ evaluator.py                # Metrics computation
โ”‚   โ”œโ”€โ”€ report_generator.py         # Report generation
โ”‚   โ”œโ”€โ”€ config.py                   # AutoMLConfig
โ”‚   โ””โ”€โ”€ legacy/                     # Legacy code (deprecated)
โ”œโ”€โ”€ models/
โ”‚   โ”œโ”€โ”€ model_zoo.py                # Model registry
โ”‚   โ”œโ”€โ”€ sk_models.py               # Sklearn wrappers
โ”‚   โ””โ”€โ”€ mlp_variants.py            # PyTorch MLPs
โ”œโ”€โ”€ utils/
โ”‚   โ”œโ”€โ”€ datasets.py                # Dataset loading & understanding
โ”‚   โ”œโ”€โ”€ logging_utils.py           # Result logging
โ”‚   โ””โ”€โ”€ seed_utils.py              # Random seed management
โ”œโ”€โ”€ configs/
โ”‚   โ”œโ”€โ”€ models.yaml                 # Model hyperparameters
โ”‚   โ””โ”€โ”€ preprocessing.yaml          # Preprocessing strategies
โ”œโ”€โ”€ examples/                       # Demo scripts
โ”œโ”€โ”€ experiments/                   # Results, meta-models, reports
โ”‚   โ”œโ”€โ”€ results.csv                # All experiment results
โ”‚   โ”œโ”€โ”€ meta_train.csv             # Meta-learning training data
โ”‚   โ”œโ”€โ”€ meta_model_*.pkl          # Trained meta-models
โ”‚   โ””โ”€โ”€ reports/                   # Generated reports
โ”œโ”€โ”€ data/                           # Datasets
โ”œโ”€โ”€ graph/                          # Neo4j graph client
โ””โ”€โ”€ web/                            # FastAPI web app

๐Ÿง  How It Works

Complete Pipeline

1. Dataset Understanding
   โ””โ”€โ”€ Load CSV โ†’ Normalize columns โ†’ Infer task type โ†’ Analyze metadata

2. Preprocessing Strategy Generation
   โ””โ”€โ”€ Generate strategies โ†’ Rank by proxy evaluation โ†’ Select top-K

3. Model Selection
   โ””โ”€โ”€ Filter models by task type โ†’ Generate candidates

4. Meta-Learning Ranking (Optional)
   โ””โ”€โ”€ Predict performance โ†’ Rank candidates by predicted score

5. Intelligent Search
   โ””โ”€โ”€ Train models โ†’ Evaluate โ†’ Log results โ†’ Stop if target reached

6. Meta-Learning Update
   โ””โ”€โ”€ Add results to history โ†’ Retrain meta-model

7. Report Generation
   โ””โ”€โ”€ Generate Markdown/JSON/Text reports

Key Components

1. Dataset Understanding (utils/datasets.py)

  • load_csv_with_fallback(): Robust CSV loading with column name normalization
  • infer_task_type(): Automatically detects regression vs classification
  • load_tabular_dataset(): Main loader that returns structured dataset info

Task Type Inference Logic:

  • Float values โ†’ Regression
  • Integer values with >50 unique and >5% unique ratio โ†’ Regression
  • 2 unique values โ†’ Binary classification
  • Otherwise โ†’ Multiclass classification

2. Preprocessing Strategy Engine (engine/preprocessing_strategies.py)

  • generate_preprocessing_strategies(): Creates preprocessing combinations

    • Numeric scalers: Standard, MinMax, Robust, PowerTransformer
    • Categorical encoders: OneHot, Ordinal
    • Numeric imputers: Mean, Median, KNN
    • Categorical imputers: Most Frequent, Constant
  • rank_preprocessors_by_proxy(): Fast evaluation on small sample (512 rows)

    • Trains cheap baseline models (LogisticRegression or RandomForest)
    • Ranks strategies by proxy score
    • Selects top-K strategies (typically top 3)

3. Model Zoo (models/model_zoo.py)

Model Registry with task-aware filtering:

  • Classification: logreg, rf, gb, mlp_small, mlp_medium, mlp_deep, svm_clf
  • Regression: linreg, ridge, lasso, rf_reg, gb_reg, mlp_small_reg, mlp_medium_reg, mlp_deep_reg, svm_reg

Model Families:

  • neural_supervised: PyTorch MLPs (classification)
  • neural_regression: PyTorch MLPs (regression)
  • classical_supervised: Sklearn models (classification)
  • tree_regression: Random Forest, Gradient Boosting (regression)
  • linear_regression: Linear, Ridge, Lasso

4. Meta-Learning Brain (engine/meta_learner.py)

How It Works:

  1. Training: Learns from past experiments in experiments/meta_train.csv

    • Extracts features: dataset meta + model + preprocessing
    • Trains RandomForestRegressor to predict performance
    • Saves per-task+metric models: meta_model_{task}_{metric}.pkl
  2. Prediction: Before training, predicts performance for each candidate

    • Ranks candidates by predicted score
    • Tries top candidates first (10x faster search!)
  3. Self-Improvement: After each run, retrains on expanded history

Features Extracted:

  • Dataset: n_samples, n_features, n_numeric, n_categorical, n_classes, class_imbalance
  • Model: One-hot encoded family
  • Preprocessing: One-hot encoded strategy type

5. Intelligent Search (engine/intelligent_search.py)

Search Process:

  1. Filter models by task type
  2. Generate preprocessing strategies
  3. Rank preprocessing by proxy (fast evaluation)
  4. Generate candidates (top preprocessing ร— all models)
  5. Rank candidates using meta-learning (if available)
  6. Train models in ranked order
  7. Stop early if target metric reached

Optimizations:

  • Proxy ranking: 4x faster (tries top 3 preprocessing instead of all 12)
  • Meta-learning: 10x faster (tries top 10 candidates instead of all 24)
  • Threshold stopping: Stops when target is reached

6. Training Engine (engine/trainer.py)

  • PyTorch models: Full training loop with early stopping
  • Sklearn models: One-shot fit
  • Task-aware loss: MSE for regression, CrossEntropy for classification

7. Evaluation System (engine/evaluator.py)

Metrics Available:

  • Classification: accuracy, f1_macro, f1_weighted, roc_auc
  • Regression: r2, neg_rmse, neg_mae (negative for "higher is better")

8. Report Generator (engine/report_generator.py)

Generates comprehensive reports in three formats:

  • Markdown: For documentation, GitHub
  • JSON: For programmatic access
  • Text: For terminal viewing

๐Ÿ“– Usage Guide

Dataset Requirements

Your CSV should have:

  • Numeric feature columns (all values numeric)
  • One label column (integer classes or strings that will be auto-encoded)

Example CSV structure:

feat1,feat2,feat3,target
0.12,5.3,10,1
0.42,2.1,3,0
...

Running Experiments

Interactive Mode

python main.py
# Select [8] Dataset-agnostic mode
# Select [2] Intelligent mode (or [1] Basic mode)
# Enter CSV path: data/your_dataset.csv
# Enter label column (or press Enter for auto-detect)
# Select metric: [1] Rยฒ, [2] Accuracy, [3] F1, etc.
# Enter target metric (or press Enter for no target)

Autopilot Mode

# Classification
python main.py --autopilot data/adult.csv --label income --metric accuracy --target 0.86

# Regression
python main.py --autopilot data/paddydataset.csv --metric r2 --target 0.99

Preparing Datasets

Built-in Datasets

# Prepare common datasets (Iris, Wine, Breast Cancer)
python examples/prepare_datasets.py

Custom CSV

  1. Place CSV in data/ directory
  2. Ensure numeric features and one label column
  3. Run AutoML - it will auto-detect everything!

Kaggle Datasets (e.g., Titanic)

# 1. Download train.csv from Kaggle
# 2. Place in data/titanic_train.csv
# 3. Preprocess
python examples/prepare_titanic.py
# 4. Run AutoML
python main.py --autopilot data/titanic_clean.csv --label Survived --metric accuracy

Configuration

Model Hyperparameters (configs/models.yaml)

mlp_small:
  lr: 0.001
  epochs: 10
  patience: 5

mlp_medium:
  lr: 0.0005
  epochs: 20
  patience: 5

Preprocessing Strategies (configs/preprocessing.yaml)

Preprocessing strategies are defined here and automatically selected based on dataset characteristics.

Using the Best Model

from engine.predictor import BestModelPredictor

# Load the best model from results
predictor = BestModelPredictor(
    results_csv="experiments/results.csv",
    in_dim=20,  # Your input dimension
    out_dim=3,  # Your number of classes
)

# Make predictions
predictions = predictor.predict_classes(x_new_batch)
probabilities = predictor.predict_proba(x_new_batch)

๐Ÿ”ง Advanced Features

Parallel Execution

The system automatically uses parallel execution for sklearn models (safe, no GPU conflicts). PyTorch models remain sequential to avoid GPU conflicts.

Configuration:

from engine.config import AutoMLConfig

config = AutoMLConfig(
    csv_path="data/dataset.csv",
    n_jobs=-1,  # Use all CPU cores
    ...
)

Neo4j Graph Database

Track all experiments in a knowledge graph:

# Start Neo4j (Docker)
docker run -d \
  --name neo4j-automl \
  -p 7474:7474 \
  -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/password \
  neo4j:latest

# The orchestrator automatically logs runs to Neo4j

Graph Schema:

  • (:Dataset)-[:USED_IN]->(:Run)
  • (:Model)-[:APPLIED_IN]->(:Run)
  • (:Preprocessor)-[:PART_OF]->(:Run)

Web API (FastAPI)

# Start web app
uvicorn web.app:app --reload

# Query API
curl http://localhost:8000/datasets
curl http://localhost:8000/datasets/adult/runs
curl http://localhost:8000/best-models/binary_classification

๐ŸŽ“ Understanding the System

Architecture Flow

User Input (CSV)
    โ†“
Dataset Understanding
    โ”œโ”€โ”€ Load CSV with fallback
    โ”œโ”€โ”€ Infer task type
    โ”œโ”€โ”€ Infer column types
    โ””โ”€โ”€ Extract metadata
    โ†“
Preprocessing Generation
    โ”œโ”€โ”€ Generate strategies
    โ””โ”€โ”€ Rank by proxy (fast evaluation)
    โ†“
Model Selection
    โ””โ”€โ”€ Filter by task type
    โ†“
Meta-Learning Ranking (Optional)
    โ””โ”€โ”€ Predict performance โ†’ Rank candidates
    โ†“
Intelligent Search
    โ”œโ”€โ”€ Apply preprocessing
    โ”œโ”€โ”€ Train models
    โ”œโ”€โ”€ Evaluate metrics
    โ””โ”€โ”€ Log results
    โ†“
Meta-Learning Update
    โ””โ”€โ”€ Retrain on expanded history
    โ†“
Report Generation
    โ””โ”€โ”€ Markdown/JSON/Text

Key Design Decisions

  1. Task Type Inference: Automatically detects regression vs classification
  2. Proxy Ranking: Fast evaluation to select top preprocessing strategies
  3. Meta-Learning: Learns from past experiments to predict performance
  4. Threshold Stopping: Stops early when target metric is reached
  5. Self-Improvement: Gets smarter with every experiment

Performance Optimizations

  • Proxy ranking: 4x faster (tries top 3 preprocessing instead of all 12)
  • Meta-learning: 10x faster (tries top 10 candidates instead of all 24)
  • Parallel execution: 4-8x speedup on multi-core systems
  • Early stopping: Saves compute on poor configurations
  • Threshold stopping: Stops when target is reached

๐Ÿ› Troubleshooting

Common Issues

JSON Serialization Error:

  • Fixed: Non-serializable objects (LabelEncoder, etc.) are removed before JSON dump

Label Column Not Detected:

  • The system tries common names: "target", "label", "y", "class", "income", etc.
  • If auto-detection fails, specify manually: --label your_label_column

Out of Memory:

  • Reduce n_jobs (e.g., n_jobs=2 instead of -1)
  • Use smaller datasets for testing

Meta-Learning Not Working:

  • Requires at least 5-10 past experiments
  • Check experiments/meta_train.csv has data
  • Meta-models saved as meta_model_{task}_{metric}.pkl

๐Ÿ“Š Example Results

Paddy Dataset (Regression)

Input: paddydataset.csv (regression, 2789 samples, 44 features)
โ†“
Task Inference: regression (477 unique values)
โ†“
Preprocessing: Generated 12 strategies, ranked top 3
โ†“
Models: Selected 8 regression models
โ†“
Meta-Learning: Ranked candidates by predicted performance
โ†“
Training: Tried top candidates
โ†“
Result: gb_reg + median_robust โ†’ Rยฒ = 0.9933
โ†“
Report: Generated in experiments/reports/

Adult Dataset (Classification)

Input: adult.csv (binary classification, 32561 samples, 14 features)
โ†“
Task Inference: binary_classification (2 classes)
โ†“
Preprocessing: Generated strategies, ranked top 3
โ†“
Models: Selected 7 classification models
โ†“
Result: rf + numeric_plus_cat_onehot โ†’ Accuracy = 0.86

๐Ÿ”ฎ Future Enhancements

  • Hyperparameter tuning (grid/random/Bayesian)
  • Feature engineering search space
  • Ensemble methods (stacking, voting)
  • Advanced preprocessing (target encoding, PCA)
  • Cross-validation option
  • Model interpretability (SHAP values)
  • Auto-deployment mode
  • Experiment tracking integration (MLflow, W&B)

๐Ÿ“ Key Files Reference

  • main.py: CLI entry point
  • engine/orchestrator.py: Main pipeline coordinator
  • engine/intelligent_search.py: Search engine
  • engine/meta_learner.py: Meta-learning brain
  • engine/preprocessing_strategies.py: Preprocessing generation
  • utils/datasets.py: Dataset loading & understanding
  • models/model_zoo.py: Model registry

๐ŸŽ‰ Summary

This AutoML system is a complete, intelligent, self-improving platform that:

  • Understands datasets automatically
  • Chooses preprocessing intelligently
  • Uses meta-learning to avoid brute force
  • Generates comprehensive reports
  • Learns from every experiment

It's not just code - it's a thinking machine that gets smarter with every experiment! ๐Ÿง โœจ

About

An intelligent, dataset-agnostic AutoML system that learns from experience using meta-learning to automatically select the best models and preprocessing strategies.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages