Audio Classification System

A lightweight audio classification system that detects harmful content in audio files using speech-to-text transcription and machine learning classification. Designed to run efficiently on resource-constrained devices with less than 2GB RAM.

Overview

This system addresses the challenge of detecting harmful content in audio files under strict resource constraints:

Memory Limit: < 2GB RAM
Training Data: ~100 labeled audio clips
Deployment: Self-contained, suitable for laptops and mobile devices

The solution uses a transcription + text classification pipeline rather than end-to-end audio processing, as harmful content is primarily determined by linguistic content rather than acoustic features.

Architecture

Approaches

Direct audio classification
- Train or fine-tune an audio model (e.g., VGGish, YAMNet, Wav2Vec2) to classify harmful vs safe directly from the waveform.
- Pros: Better at classification as it captures non-verbal cues, handles audio data directly,
- Cons: needs large datasets in hundreds of hours, high latency, not interpretable
Classical Classifiers:
- Pipeline: Audio → ASR (Whisper Tiny) → Transcript → Embed (TF-IDF / MiniLM / BERT) → Classifier (Naive Bayes / LDA / LR)
- Pros: faster, probabilistic, interpretable, pipeline steps can be optimized separately
- Cons: not advanced, needs text embedding model, needs ASR model

Audio Input (.mp3) → ASR (Whisper Tiny) → Text Transcript → TF-IDF Vectorizer → Naive Bayes Classifier → Prediction (Safe/Harmful)

Why this approach?

Linguistic Focus: Harmful content depends on what is said, not how it's said
Data Efficiency: Text classification works better with limited training data (~100 samples)
Interpretability: Provides explanations for predictions
Performance: Lower latency and memory footprint than end-to-end audio models
Modularity: Each component can be optimized independently

Approach & Methodology

Model Selection

ASR: Whisper Tiny (39M parameters, fast, accurate) or Faster-whisper (Quantized whisper tiny model)
Vectorizer: TF-IDF with tri-grams (captures phrase patterns for harmful content)
Classifier: Naive Bayes (probabilistic, fast, works well with limited data)

Key Design Decisions

Transcription + Text Classification over direct audio classification
TF-IDF over deep learning embeddings (better with small datasets)
Naive Bayes over more complex classifiers (fast, interpretable, good performance)
Whisper Tiny over larger models (balance of accuracy and speed)

Results

Performance Metrics

Based on evaluation of the trained model (output/20251012_161527_naive_bayes_tfidf):

Metric	Value
Accuracy	80.0%
F1-Score (Macro)	79.2%
ROC-AUC	93.8%
Inference Latency	~2ms (excluding transcription)
Memory Usage	~727MB peak

Figures

Figure 1: Confusion Matrix

Figure 2: Performance Summary

Figure 3: ROC Curve

Full evaluation report available in output/20251012_161527_naive_bayes_tfidf/

Quick Start

Installation

# Install dependencies
uv sync

# Activate environment
source .venv/bin/activate

Training

# Configure model parameters in config/config.yaml
python train_classifier.py

Inference

REST API (Recommended):

python main.py
# Visit http://localhost:8000/docs for interactive API

Command Line:

python classify_audio.py --audio_file path/to/audio.mp3
python classify_audio.py --audio_dir path/to/directory

Configuration

The system is configured via config/config.yaml. Key settings include:

Data paths: Safe and harmful audio directories
Model selection: Classifier type, vectorizer parameters
ASR settings: Whisper model size, transcription options
Evaluation: Metrics and visualization preferences

See docs/CONFIGURATION.md for detailed configuration guide

Evaluation Methodology

The system tracks comprehensive metrics:

Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC
Performance: Latency statistics, memory usage, throughput
Visualizations: Confusion matrix, ROC curves, precision-recall curves, embedding plots

Project Structure

Project is structured such that it can be easily extended with different classifier and vectorizer algorithms.

audio-classification/
├── backend/                 # Core ML pipeline
│   ├── src/
│   │   ├── classifiers/    # ML classifiers
│   │   │   ├── ml/         # LDA, Navie Bayes, LR, etc.
│   │   ├── vectorizers/    # Text vectorization
│   │   │   ├── nlp/         # TFIDF, etc.
│   │   ├── pipelines/      # Training/inference pipelines
│   │   └── evaluation/     # Metrics and visualizations
│   ├── train.py           # Training script
│   └── infer.py           # Inference engine
├── config/                 # Configuration files
├── data/                   # Training data
├── output/                 # Model outputs and results
├── docs/                   # Detailed documentation
├── main.py                # REST API server
├── classify_audio.py      # CLI interface
└── train_classifier.py    # Training entry point

API Reference

The system provides a REST API with the following endpoints:

POST /classify_audio - Classify single audio file
POST /classify_audio_batch - Classify multiple audio files
POST /classify_text - Classify text directly
GET /health - Health check

Output response

POST /classify_audio:

{
  "filename": "filename",
  "transcript": "Transcribed audio text",
  "time_taken": {
    "total_time": "inference time",
    "transcription_time": "transcription time",
    "prediction_time": "classifier prediction time"
  },
  "prediction": "safe or harmful",
  "confidence": prediction_confidence,
  "explanation": "prediction explanation",
  "success": true,
  "error": null
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
backend		backend
config		config
data/transcribed		data/transcribed
docs		docs
output/20251012_161527_naive_bayes_tfidf		output/20251012_161527_naive_bayes_tfidf
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
classify_audio.py		classify_audio.py
classify_text.py		classify_text.py
main.py		main.py
pyproject.toml		pyproject.toml
train_classifier.py		train_classifier.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Audio Classification System

Overview

Architecture

Approaches

Approach & Methodology

Model Selection

Key Design Decisions

Results

Performance Metrics

Figures

Quick Start

Installation

Training

Inference

Configuration

Evaluation Methodology

Project Structure

API Reference

Output response

About

Uh oh!

Releases

Packages

Languages

pimakshay/audio-classification

Folders and files

Latest commit

History

Repository files navigation

Audio Classification System

Overview

Architecture

Approaches

Approach & Methodology

Model Selection

Key Design Decisions

Results

Performance Metrics

Figures

Quick Start

Installation

Training

Inference

Configuration

Evaluation Methodology

Project Structure

API Reference

Output response

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages