VideoDigest 🎬

A deep learning-based application that generates text summaries from video inputs. The project features a Streamlit frontend for video upload and a FastAPI backend deployed on Modal that processes videos through a sophisticated ML pipeline to generate summaries.

🎯 Features

🎥 Video Upload Interface: User-friendly Streamlit frontend supporting .mp4, .mkv, and .mov formats
🧠 Intelligent Frame Extraction: Automatically extracts and selects key frames from videos
🤖 Deep Learning Pipeline:
- Frame feature extraction using GoogLeNet
- Key frame selection using R(2+1)D video model
- Caption generation using BLIP-2 (Salesforce/blip2-opt-2.7b)
- Optional API-based final summarization with OpenRouter
☁️ Cloud Deployment: Backend hosted on Modal (serverless with GPU support), frontend on Streamlit Cloud
🧹 Automatic Cleanup: Temporary files and old uploads are automatically managed
🔒 Secure Configuration: Environment-based configuration for API keys and secrets

🏗️ Architecture

┌─────────────────┐          ┌──────────────────┐          ┌─────────────────┐
│  Streamlit      │          │   Modal          │          │   ML Pipeline   │
│  Frontend       │────────▶│   Backend         │────────▶│   (PyTorch)     │
│  (Cloud/Local)  │  HTTP    │   (FastAPI)      │          │   • GoogLeNet   │
│                 │          │   • GPU Support  │          │   • R(2+1)D     │
│                 │          │   • Auto-scaling │          │   • BLIP-2      │
└─────────────────┘          └──────────────────┘          └─────────────────┘

Frontend (`app/`)

Streamlit Application: Web-based UI for video upload and result display
Handles file upload, displays processing status, and manages user sessions
Automatic cleanup of old uploads (24+ hours)
Configurable backend API URL (Modal deployment)

Backend (`backend/`)

FastAPI Server: RESTful API for video processing
Modal Deployment: Serverless deployment with GPU support (T4/A10G)
ML Pipeline: Complete deep learning workflow for video-to-text summarization
- Video frame extraction (OpenCV)
- Feature extraction (GoogLeNet)
- Key frame selection (R(2+1)D video model)
- Caption generation (BLIP-2)
- Optional API summarization (OpenRouter)

🛠️ Technology Stack

Core Framework

Python 3.11+
FastAPI - Modern, fast web framework
Streamlit - Rapid web app development

Deployment & Infrastructure

Modal - Serverless GPU platform for backend
Streamlit Cloud - Hosting for frontend

Deep Learning & ML

PyTorch - Deep learning framework
Torchvision - Pre-trained models (GoogLeNet, R(2+1)D)
Transformers (HuggingFace) - BLIP-2 model

Computer Vision

OpenCV - Video processing and frame extraction
Pillow (PIL) - Image processing

Utilities

Uvicorn - ASGI server
python-dotenv - Environment variable management
Requests - HTTP client

📋 Requirements

Python: 3.8+
GPU: Optional but recommended for faster processing (T4 available on Modal)
Memory: Minimum 4GB RAM (8GB+ recommended for local development)
Disk Space: For video storage and model cache (~5-10GB for models)

🚀 Quick Start

Local Development Setup

1. Clone the Repository

git clone <repository-url>
cd Video-Summary-Generator

2. Create Virtual Environment

python -m venv .venv

# On Windows
.venv\Scripts\activate

# On Linux/Mac
source .venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Run Locally

Backend (local):

cd backend
uvicorn main:app --reload --host 0.0.0.0 --port 8000

Frontend (local):

cd app
# Set backend URL (for local backend)
export BACKEND_API_URL=http://localhost:8000  # Linux/Mac
# OR
set BACKEND_API_URL=http://localhost:8000  # Windows CMD
# OR
$env:BACKEND_API_URL="http://localhost:8000"  # Windows PowerShell

streamlit run main.py

☁️ Cloud Deployment

Deploy Backend to Modal

Install Modal CLI:

pip install modal
modal token new  # Authenticate

Deploy Backend:

modal deploy backend/modal_app.py

Get Your Modal URL: After deployment, Modal provides a URL like:

https://your-username--video-summary-generator-fastapi-app.modal.run

Configure Secrets (Optional):

modal secret create env \
  API_URL=https://your-api-url.com \
  API_KEY=your_api_key \
  MODEL_NAME=tngtech/deepseek-r1t2-chimera:free

Deploy Frontend to Streamlit Cloud

Push code to GitHub
Configure Streamlit Cloud:
- Go to streamlit.io/cloud
- Click "New app"
- Connect your GitHub repository
- Set Main file path: app/main.py

Add Secrets: In Streamlit Cloud settings, add:

BACKEND_API_URL = "https://your-modal-url.modal.run"

Deploy!

Test Modal Backend Locally

You can run the frontend locally and connect it to your Modal backend:

# Set Modal backend URL
export BACKEND_API_URL="https://your-username--video-summary-generator-fastapi-app.modal.run"

# Run frontend
cd app
streamlit run main.py

📖 Usage

Using the Application

Upload Video: Use the Streamlit frontend to upload a video file (.mp4, .mkv, or .mov)
Process: Click "Process Video" button
Wait: Processing can take 5-30 minutes depending on video length (first request loads models)
View Results: The generated summary will be displayed

API Usage (Direct)

You can also call the Modal API directly:

curl -X POST "https://your-modal-url.modal.run/process_upload" \
  -H "accept: application/json" \
  -F "file=@your_video.mp4"

Response:

{
  "status": "success",
  "run_id": "uuid-here",
  "summary": "Generated summary text...",
  "original_filename": "your_video.mp4",
  "file_size_bytes": 12345678
}

📁 Project Structure

Video-Summary-Generator/
│
├── app/                          # Streamlit Frontend
│   ├── main.py                 # Main Streamlit application
│   ├── uploads/                # User uploads (auto-created, gitignored)
│   └── .streamlit/             # Streamlit configuration
│       └── secrets.toml        # Secrets for local (gitignored)
│
├── backend/                     # FastAPI Backend
│   ├── main.py                 # FastAPI application
│   ├── pipeline.py             # ML pipeline implementation
│   ├── modal_app.py            # Modal deployment configuration
│   ├── uploads/                # Processed videos (gitignored)
│   └── .env                    # Environment variables (gitignored)
│
├── notebooks/                   # Development notebooks
│   └── notebook1.ipynb         # Original pipeline development
│
├── requirements.txt            # Root-level dependencies
├── .gitignore                  # Git ignore rules
├── LICENSE                     # License file
└── README.md                   # This file

🔧 Configuration

Frontend Configuration

Local (Environment Variable):

export BACKEND_API_URL="https://your-modal-url.modal.run"

Local (Streamlit secrets): Create app/.streamlit/secrets.toml:

BACKEND_API_URL = "https://your-modal-url.modal.run"

Streamlit Cloud: Add secret in dashboard with key BACKEND_API_URL

Backend Configuration (Modal)

Modal Secrets: Set via Modal dashboard or CLI:

modal secret create env \
  API_URL=https://your-api-url.com \
  API_KEY=your_api_key \
  MODEL_NAME=tngtech/deepseek-r1t2-chimera:free

Note: API configuration is optional. If not provided, the pipeline returns combined frame captions without final API summarization.

Pipeline Parameters

Customize in backend/pipeline.py:

frame_skip: Extract every Nth frame (default: 30)
importance_threshold: Threshold for frame selection (default: 0.5)

Modal Configuration

Edit backend/modal_app.py:

gpu: GPU type ("T4", "A10G", "A100", or None for CPU)
timeout: Request timeout in seconds (default: 1800)
memory: Memory allocation in MB (default: 8192)

🔌 API Endpoints

`GET /`

Health check endpoint.

Response:

{
  "status": "ok",
  "message": "Video Summary Generator API is running",
  "version": "1.0.0"
}

`GET /health`

Detailed health check.

Response:

{
  "status": "healthy",
  "uploads_directory": "/path/to/uploads",
  "uploads_directory_exists": true
}

`POST /process_upload`

Process an uploaded video and generate a summary.

Request:

Method: POST
Content-Type: multipart/form-data
Body: Video file (.mp4, .mkv, or .mov)
Max file size: 500 MB

Response:

{
  "status": "success",
  "run_id": "uuid-here",
  "summary": "Generated summary text...",
  "original_filename": "video.mp4",
  "file_size_bytes": 12345678
}

Error Responses:

400: Unsupported file format or invalid request
413: File size exceeds limit
500: Processing error

🧠 ML Pipeline Details

The pipeline consists of several stages:

Frame Extraction: Uses OpenCV to extract frames at configurable intervals (every 30th frame by default)
Feature Extraction: GoogLeNet extracts 1024-dimensional feature vectors from each frame
Key Frame Selection: R(2+1)D video model analyzes temporal clips to select ~25 most important frames
Caption Generation: BLIP-2 (2.7B parameter model) generates detailed natural language captions for key frames
Summarization: Optional OpenRouter API call for final summary refinement and narrative coherence

🧹 Cleanup

The application includes automatic cleanup features:

Frontend: Old uploads (24+ hours) are automatically removed on app start
Backend: Temporary processing files are cleaned up after each run
Manual Cleanup: Use the sidebar in the frontend to manually clean uploads

🐛 Troubleshooting

Modal Deployment Issues

Server not responding:

# Check logs
modal app logs video-summary-generator

# Test health endpoint
curl https://your-url.modal.run/health

# Redeploy
modal deploy backend/modal_app.py

Cold start delays:

First request after inactivity may take 30-60 seconds (model loading)
This is normal for serverless deployments

GPU not available:

Edit backend/modal_app.py and set gpu=None for CPU-only

Frontend Connection Issues

Backend not connecting:

Verify BACKEND_API_URL is set correctly
Test Modal URL: curl https://your-url.modal.run/health
Check CORS (Modal handles this automatically)

📊 Performance Expectations

Frame Extraction: ~1-2 seconds per minute of video
Feature Extraction: ~2-5 seconds per frame (CPU) / ~0.5-1 second (GPU)
Caption Generation: ~1-2 seconds per key frame
Total Time:
- CPU: ~5-10 minutes for a 1-minute video
- GPU (T4): ~2-5 minutes for a 1-minute video

📝 License

See LICENSE file for details.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📞 Support

For issues and questions:

Open an issue on the repository
Check Modal documentation: modal.com/docs
Check Streamlit documentation: docs.streamlit.io

🙏 Acknowledgments

GoogLeNet: Pre-trained model for feature extraction
R(2+1)D: Video understanding model for temporal analysis
BLIP-2: Salesforce's advanced image captioning model
Modal: Serverless GPU infrastructure
Streamlit: Rapid web app framework
OpenRouter: API gateway for LLM summarization

Note: This project is under active development. The pipeline may take several minutes to process videos depending on length and hardware capabilities. First request on Modal may experience cold start delays while models are loaded.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.devcontainer		.devcontainer
Automation Workflows		Automation Workflows
app		app
assets		assets
backend		backend
notebooks		notebooks
papers		papers
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
final report.pdf		final report.pdf
requirements.txt		requirements.txt
setup.sh		setup.sh

License

deepesh611/Video-Summary-Generator

Folders and files

Latest commit

History

Repository files navigation

VideoDigest 🎬

🎯 Features

🏗️ Architecture

Frontend (app/)

Backend (backend/)

🛠️ Technology Stack

Core Framework

Deployment & Infrastructure

Deep Learning & ML

Computer Vision

Utilities

📋 Requirements

🚀 Quick Start

Local Development Setup

1. Clone the Repository

2. Create Virtual Environment

3. Install Dependencies

4. Run Locally

☁️ Cloud Deployment

Deploy Backend to Modal

Deploy Frontend to Streamlit Cloud

Test Modal Backend Locally

📖 Usage

Using the Application

API Usage (Direct)

📁 Project Structure

🔧 Configuration

Frontend Configuration

Backend Configuration (Modal)

Pipeline Parameters

Modal Configuration

🔌 API Endpoints

GET /

GET /health

POST /process_upload

🧠 ML Pipeline Details

🧹 Cleanup

🐛 Troubleshooting

Modal Deployment Issues

Frontend Connection Issues

📊 Performance Expectations

📝 License

🤝 Contributing

📞 Support

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Frontend (`app/`)

Backend (`backend/`)

`GET /`

`GET /health`

`POST /process_upload`

Packages