Skip to content

pimakshay/audio-classification

Repository files navigation

Audio Classification System

A lightweight audio classification system that detects harmful content in audio files using speech-to-text transcription and machine learning classification. Designed to run efficiently on resource-constrained devices with less than 2GB RAM.

Overview

This system addresses the challenge of detecting harmful content in audio files under strict resource constraints:

  • Memory Limit: < 2GB RAM
  • Training Data: ~100 labeled audio clips
  • Deployment: Self-contained, suitable for laptops and mobile devices

The solution uses a transcription + text classification pipeline rather than end-to-end audio processing, as harmful content is primarily determined by linguistic content rather than acoustic features.

Architecture

Approaches

  • Direct audio classification

    • Train or fine-tune an audio model (e.g., VGGish, YAMNet, Wav2Vec2) to classify harmful vs safe directly from the waveform.
    • Pros: Better at classification as it captures non-verbal cues, handles audio data directly,
    • Cons: needs large datasets in hundreds of hours, high latency, not interpretable
  • Classical Classifiers:

    • Pipeline: Audio → ASR (Whisper Tiny) → Transcript → Embed (TF-IDF / MiniLM / BERT) → Classifier (Naive Bayes / LDA / LR)
    • Pros: faster, probabilistic, interpretable, pipeline steps can be optimized separately
    • Cons: not advanced, needs text embedding model, needs ASR model
Audio Input (.mp3) → ASR (Whisper Tiny) → Text Transcript → TF-IDF Vectorizer → Naive Bayes Classifier → Prediction (Safe/Harmful)

Why this approach?

  • Linguistic Focus: Harmful content depends on what is said, not how it's said
  • Data Efficiency: Text classification works better with limited training data (~100 samples)
  • Interpretability: Provides explanations for predictions
  • Performance: Lower latency and memory footprint than end-to-end audio models
  • Modularity: Each component can be optimized independently

Approach & Methodology

Model Selection

  • ASR: Whisper Tiny (39M parameters, fast, accurate) or Faster-whisper (Quantized whisper tiny model)
  • Vectorizer: TF-IDF with tri-grams (captures phrase patterns for harmful content)
  • Classifier: Naive Bayes (probabilistic, fast, works well with limited data)

Key Design Decisions

  1. Transcription + Text Classification over direct audio classification
  2. TF-IDF over deep learning embeddings (better with small datasets)
  3. Naive Bayes over more complex classifiers (fast, interpretable, good performance)
  4. Whisper Tiny over larger models (balance of accuracy and speed)

Results

Performance Metrics

Based on evaluation of the trained model (output/20251012_161527_naive_bayes_tfidf):

Metric Value
Accuracy 80.0%
F1-Score (Macro) 79.2%
ROC-AUC 93.8%
Inference Latency ~2ms (excluding transcription)
Memory Usage ~727MB peak

Figures


Figure 1: Confusion Matrix

Figure 2: Performance Summary

Figure 3: ROC Curve

Full evaluation report available in output/20251012_161527_naive_bayes_tfidf/

Quick Start

Installation

# Install dependencies
uv sync

# Activate environment
source .venv/bin/activate

Training

# Configure model parameters in config/config.yaml
python train_classifier.py

Inference

REST API (Recommended):

python main.py
# Visit http://localhost:8000/docs for interactive API

Command Line:

python classify_audio.py --audio_file path/to/audio.mp3
python classify_audio.py --audio_dir path/to/directory

Configuration

The system is configured via config/config.yaml. Key settings include:

  • Data paths: Safe and harmful audio directories
  • Model selection: Classifier type, vectorizer parameters
  • ASR settings: Whisper model size, transcription options
  • Evaluation: Metrics and visualization preferences

See docs/CONFIGURATION.md for detailed configuration guide

Evaluation Methodology

The system tracks comprehensive metrics:

  • Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC
  • Performance: Latency statistics, memory usage, throughput
  • Visualizations: Confusion matrix, ROC curves, precision-recall curves, embedding plots

Project Structure

Project is structured such that it can be easily extended with different classifier and vectorizer algorithms.

audio-classification/
├── backend/                 # Core ML pipeline
│   ├── src/
│   │   ├── classifiers/    # ML classifiers
│   │   │   ├── ml/         # LDA, Navie Bayes, LR, etc.
│   │   ├── vectorizers/    # Text vectorization
│   │   │   ├── nlp/         # TFIDF, etc.
│   │   ├── pipelines/      # Training/inference pipelines
│   │   └── evaluation/     # Metrics and visualizations
│   ├── train.py           # Training script
│   └── infer.py           # Inference engine
├── config/                 # Configuration files
├── data/                   # Training data
├── output/                 # Model outputs and results
├── docs/                   # Detailed documentation
├── main.py                # REST API server
├── classify_audio.py      # CLI interface
└── train_classifier.py    # Training entry point

API Reference

The system provides a REST API with the following endpoints:

  • POST /classify_audio - Classify single audio file
  • POST /classify_audio_batch - Classify multiple audio files
  • POST /classify_text - Classify text directly
  • GET /health - Health check

Output response

POST /classify_audio:

{
  "filename": "filename",
  "transcript": "Transcribed audio text",
  "time_taken": {
    "total_time": "inference time",
    "transcription_time": "transcription time",
    "prediction_time": "classifier prediction time"
  },
  "prediction": "safe or harmful",
  "confidence": prediction_confidence,
  "explanation": "prediction explanation",
  "success": true,
  "error": null
}

About

Classifying audio into safe or harmful with proper explanation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages