A lightweight audio classification system that detects harmful content in audio files using speech-to-text transcription and machine learning classification. Designed to run efficiently on resource-constrained devices with less than 2GB RAM.
This system addresses the challenge of detecting harmful content in audio files under strict resource constraints:
- Memory Limit: < 2GB RAM
- Training Data: ~100 labeled audio clips
- Deployment: Self-contained, suitable for laptops and mobile devices
The solution uses a transcription + text classification pipeline rather than end-to-end audio processing, as harmful content is primarily determined by linguistic content rather than acoustic features.
-
Direct audio classification
- Train or fine-tune an audio model (e.g., VGGish, YAMNet, Wav2Vec2) to classify harmful vs safe directly from the waveform.
- Pros: Better at classification as it captures non-verbal cues, handles audio data directly,
- Cons: needs large datasets in hundreds of hours, high latency, not interpretable
-
Classical Classifiers:
- Pipeline:
Audio → ASR (Whisper Tiny) → Transcript → Embed (TF-IDF / MiniLM / BERT) → Classifier (Naive Bayes / LDA / LR) - Pros: faster, probabilistic, interpretable, pipeline steps can be optimized separately
- Cons: not advanced, needs text embedding model, needs ASR model
- Pipeline:
Audio Input (.mp3) → ASR (Whisper Tiny) → Text Transcript → TF-IDF Vectorizer → Naive Bayes Classifier → Prediction (Safe/Harmful)
Why this approach?
- Linguistic Focus: Harmful content depends on what is said, not how it's said
- Data Efficiency: Text classification works better with limited training data (~100 samples)
- Interpretability: Provides explanations for predictions
- Performance: Lower latency and memory footprint than end-to-end audio models
- Modularity: Each component can be optimized independently
- ASR: Whisper Tiny (39M parameters, fast, accurate) or Faster-whisper (Quantized whisper tiny model)
- Vectorizer: TF-IDF with tri-grams (captures phrase patterns for harmful content)
- Classifier: Naive Bayes (probabilistic, fast, works well with limited data)
- Transcription + Text Classification over direct audio classification
- TF-IDF over deep learning embeddings (better with small datasets)
- Naive Bayes over more complex classifiers (fast, interpretable, good performance)
- Whisper Tiny over larger models (balance of accuracy and speed)
Based on evaluation of the trained model (output/20251012_161527_naive_bayes_tfidf):
| Metric | Value |
|---|---|
| Accuracy | 80.0% |
| F1-Score (Macro) | 79.2% |
| ROC-AUC | 93.8% |
| Inference Latency | ~2ms (excluding transcription) |
| Memory Usage | ~727MB peak |
![]() Figure 1: Confusion Matrix |
![]() Figure 2: Performance Summary |
![]() Figure 3: ROC Curve |
Full evaluation report available in output/20251012_161527_naive_bayes_tfidf/
# Install dependencies
uv sync
# Activate environment
source .venv/bin/activate# Configure model parameters in config/config.yaml
python train_classifier.pyREST API (Recommended):
python main.py
# Visit http://localhost:8000/docs for interactive APICommand Line:
python classify_audio.py --audio_file path/to/audio.mp3
python classify_audio.py --audio_dir path/to/directoryThe system is configured via config/config.yaml. Key settings include:
- Data paths: Safe and harmful audio directories
- Model selection: Classifier type, vectorizer parameters
- ASR settings: Whisper model size, transcription options
- Evaluation: Metrics and visualization preferences
See docs/CONFIGURATION.md for detailed configuration guide
The system tracks comprehensive metrics:
- Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC
- Performance: Latency statistics, memory usage, throughput
- Visualizations: Confusion matrix, ROC curves, precision-recall curves, embedding plots
Project is structured such that it can be easily extended with different classifier and vectorizer algorithms.
audio-classification/
├── backend/ # Core ML pipeline
│ ├── src/
│ │ ├── classifiers/ # ML classifiers
│ │ │ ├── ml/ # LDA, Navie Bayes, LR, etc.
│ │ ├── vectorizers/ # Text vectorization
│ │ │ ├── nlp/ # TFIDF, etc.
│ │ ├── pipelines/ # Training/inference pipelines
│ │ └── evaluation/ # Metrics and visualizations
│ ├── train.py # Training script
│ └── infer.py # Inference engine
├── config/ # Configuration files
├── data/ # Training data
├── output/ # Model outputs and results
├── docs/ # Detailed documentation
├── main.py # REST API server
├── classify_audio.py # CLI interface
└── train_classifier.py # Training entry point
The system provides a REST API with the following endpoints:
POST /classify_audio- Classify single audio filePOST /classify_audio_batch- Classify multiple audio filesPOST /classify_text- Classify text directlyGET /health- Health check
POST /classify_audio:
{
"filename": "filename",
"transcript": "Transcribed audio text",
"time_taken": {
"total_time": "inference time",
"transcription_time": "transcription time",
"prediction_time": "classifier prediction time"
},
"prediction": "safe or harmful",
"confidence": prediction_confidence,
"explanation": "prediction explanation",
"success": true,
"error": null
}


