Skip to content

deepesh611/Video-Summary-Generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VideoDigest 🎬

Python FastAPI Streamlit PyTorch Modal

A deep learning-based application that generates text summaries from video inputs. The project features a Streamlit frontend for video upload and a FastAPI backend deployed on Modal that processes videos through a sophisticated ML pipeline to generate summaries.

🎯 Features

  • πŸŽ₯ Video Upload Interface: User-friendly Streamlit frontend supporting .mp4, .mkv, and .mov formats
  • 🧠 Intelligent Frame Extraction: Automatically extracts and selects key frames from videos
  • πŸ€– Deep Learning Pipeline:
    • Frame feature extraction using GoogLeNet
    • Key frame selection using R(2+1)D video model
    • Caption generation using BLIP-2 (Salesforce/blip2-opt-2.7b)
    • Optional API-based final summarization with OpenRouter
  • ☁️ Cloud Deployment: Backend hosted on Modal (serverless with GPU support), frontend on Streamlit Cloud
  • 🧹 Automatic Cleanup: Temporary files and old uploads are automatically managed
  • πŸ”’ Secure Configuration: Environment-based configuration for API keys and secrets

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Streamlit      β”‚          β”‚   Modal          β”‚          β”‚   ML Pipeline   β”‚
β”‚  Frontend       │────────▢│   Backend         │────────▢│   (PyTorch)     β”‚
β”‚  (Cloud/Local)  β”‚  HTTP    β”‚   (FastAPI)      β”‚          β”‚   β€’ GoogLeNet   β”‚
β”‚                 β”‚          β”‚   β€’ GPU Support  β”‚          β”‚   β€’ R(2+1)D     β”‚
β”‚                 β”‚          β”‚   β€’ Auto-scaling β”‚          β”‚   β€’ BLIP-2      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Frontend (app/)

  • Streamlit Application: Web-based UI for video upload and result display
  • Handles file upload, displays processing status, and manages user sessions
  • Automatic cleanup of old uploads (24+ hours)
  • Configurable backend API URL (Modal deployment)

Backend (backend/)

  • FastAPI Server: RESTful API for video processing
  • Modal Deployment: Serverless deployment with GPU support (T4/A10G)
  • ML Pipeline: Complete deep learning workflow for video-to-text summarization
    • Video frame extraction (OpenCV)
    • Feature extraction (GoogLeNet)
    • Key frame selection (R(2+1)D video model)
    • Caption generation (BLIP-2)
    • Optional API summarization (OpenRouter)

πŸ› οΈ Technology Stack

Core Framework

  • Python Python 3.11+
  • FastAPI FastAPI - Modern, fast web framework
  • Streamlit Streamlit - Rapid web app development

Deployment & Infrastructure

  • Modal Modal - Serverless GPU platform for backend
  • Streamlit Cloud Streamlit Cloud - Hosting for frontend

Deep Learning & ML

  • PyTorch PyTorch - Deep learning framework
  • Torchvision Torchvision - Pre-trained models (GoogLeNet, R(2+1)D)
  • Transformers Transformers (HuggingFace) - BLIP-2 model

Computer Vision

  • OpenCV OpenCV - Video processing and frame extraction
  • Pillow Pillow (PIL) - Image processing

Utilities

  • Uvicorn Uvicorn - ASGI server
  • python-dotenv python-dotenv - Environment variable management
  • Requests Requests - HTTP client

πŸ“‹ Requirements

  • Python: 3.8+
  • GPU: Optional but recommended for faster processing (T4 available on Modal)
  • Memory: Minimum 4GB RAM (8GB+ recommended for local development)
  • Disk Space: For video storage and model cache (~5-10GB for models)

πŸš€ Quick Start

Local Development Setup

1. Clone the Repository

git clone <repository-url>
cd Video-Summary-Generator

2. Create Virtual Environment

python -m venv .venv

# On Windows
.venv\Scripts\activate

# On Linux/Mac
source .venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Run Locally

Backend (local):

cd backend
uvicorn main:app --reload --host 0.0.0.0 --port 8000

Frontend (local):

cd app
# Set backend URL (for local backend)
export BACKEND_API_URL=http://localhost:8000  # Linux/Mac
# OR
set BACKEND_API_URL=http://localhost:8000  # Windows CMD
# OR
$env:BACKEND_API_URL="http://localhost:8000"  # Windows PowerShell

streamlit run main.py

☁️ Cloud Deployment

Deploy Backend to Modal

  1. Install Modal CLI:
pip install modal
modal token new  # Authenticate
  1. Deploy Backend:
modal deploy backend/modal_app.py
  1. Get Your Modal URL: After deployment, Modal provides a URL like:
https://your-username--video-summary-generator-fastapi-app.modal.run
  1. Configure Secrets (Optional):
modal secret create env \
  API_URL=https://your-api-url.com \
  API_KEY=your_api_key \
  MODEL_NAME=tngtech/deepseek-r1t2-chimera:free

Deploy Frontend to Streamlit Cloud

  1. Push code to GitHub

  2. Configure Streamlit Cloud:

    • Go to streamlit.io/cloud
    • Click "New app"
    • Connect your GitHub repository
    • Set Main file path: app/main.py
  3. Add Secrets: In Streamlit Cloud settings, add:

    BACKEND_API_URL = "https://your-modal-url.modal.run"
  4. Deploy!

Test Modal Backend Locally

You can run the frontend locally and connect it to your Modal backend:

# Set Modal backend URL
export BACKEND_API_URL="https://your-username--video-summary-generator-fastapi-app.modal.run"

# Run frontend
cd app
streamlit run main.py

πŸ“– Usage

Using the Application

  1. Upload Video: Use the Streamlit frontend to upload a video file (.mp4, .mkv, or .mov)
  2. Process: Click "Process Video" button
  3. Wait: Processing can take 5-30 minutes depending on video length (first request loads models)
  4. View Results: The generated summary will be displayed

API Usage (Direct)

You can also call the Modal API directly:

curl -X POST "https://your-modal-url.modal.run/process_upload" \
  -H "accept: application/json" \
  -F "file=@your_video.mp4"

Response:

{
  "status": "success",
  "run_id": "uuid-here",
  "summary": "Generated summary text...",
  "original_filename": "your_video.mp4",
  "file_size_bytes": 12345678
}

πŸ“ Project Structure

Video-Summary-Generator/
β”‚
β”œβ”€β”€ app/                          # Streamlit Frontend
β”‚   β”œβ”€β”€ main.py                 # Main Streamlit application
β”‚   β”œβ”€β”€ uploads/                # User uploads (auto-created, gitignored)
β”‚   └── .streamlit/             # Streamlit configuration
β”‚       └── secrets.toml        # Secrets for local (gitignored)
β”‚
β”œβ”€β”€ backend/                     # FastAPI Backend
β”‚   β”œβ”€β”€ main.py                 # FastAPI application
β”‚   β”œβ”€β”€ pipeline.py             # ML pipeline implementation
β”‚   β”œβ”€β”€ modal_app.py            # Modal deployment configuration
β”‚   β”œβ”€β”€ uploads/                # Processed videos (gitignored)
β”‚   └── .env                    # Environment variables (gitignored)
β”‚
β”œβ”€β”€ notebooks/                   # Development notebooks
β”‚   └── notebook1.ipynb         # Original pipeline development
β”‚
β”œβ”€β”€ requirements.txt            # Root-level dependencies
β”œβ”€β”€ .gitignore                  # Git ignore rules
β”œβ”€β”€ LICENSE                     # License file
└── README.md                   # This file

πŸ”§ Configuration

Frontend Configuration

Local (Environment Variable):

export BACKEND_API_URL="https://your-modal-url.modal.run"

Local (Streamlit secrets): Create app/.streamlit/secrets.toml:

BACKEND_API_URL = "https://your-modal-url.modal.run"

Streamlit Cloud: Add secret in dashboard with key BACKEND_API_URL

Backend Configuration (Modal)

Modal Secrets: Set via Modal dashboard or CLI:

modal secret create env \
  API_URL=https://your-api-url.com \
  API_KEY=your_api_key \
  MODEL_NAME=tngtech/deepseek-r1t2-chimera:free

Note: API configuration is optional. If not provided, the pipeline returns combined frame captions without final API summarization.

Pipeline Parameters

Customize in backend/pipeline.py:

  • frame_skip: Extract every Nth frame (default: 30)
  • importance_threshold: Threshold for frame selection (default: 0.5)

Modal Configuration

Edit backend/modal_app.py:

  • gpu: GPU type ("T4", "A10G", "A100", or None for CPU)
  • timeout: Request timeout in seconds (default: 1800)
  • memory: Memory allocation in MB (default: 8192)

πŸ”Œ API Endpoints

GET /

Health check endpoint.

Response:

{
  "status": "ok",
  "message": "Video Summary Generator API is running",
  "version": "1.0.0"
}

GET /health

Detailed health check.

Response:

{
  "status": "healthy",
  "uploads_directory": "/path/to/uploads",
  "uploads_directory_exists": true
}

POST /process_upload

Process an uploaded video and generate a summary.

Request:

  • Method: POST
  • Content-Type: multipart/form-data
  • Body: Video file (.mp4, .mkv, or .mov)
  • Max file size: 500 MB

Response:

{
  "status": "success",
  "run_id": "uuid-here",
  "summary": "Generated summary text...",
  "original_filename": "video.mp4",
  "file_size_bytes": 12345678
}

Error Responses:

  • 400: Unsupported file format or invalid request
  • 413: File size exceeds limit
  • 500: Processing error

🧠 ML Pipeline Details

The pipeline consists of several stages:

  1. Frame Extraction: Uses OpenCV to extract frames at configurable intervals (every 30th frame by default)
  2. Feature Extraction: GoogLeNet extracts 1024-dimensional feature vectors from each frame
  3. Key Frame Selection: R(2+1)D video model analyzes temporal clips to select ~25 most important frames
  4. Caption Generation: BLIP-2 (2.7B parameter model) generates detailed natural language captions for key frames
  5. Summarization: Optional OpenRouter API call for final summary refinement and narrative coherence

🧹 Cleanup

The application includes automatic cleanup features:

  • Frontend: Old uploads (24+ hours) are automatically removed on app start
  • Backend: Temporary processing files are cleaned up after each run
  • Manual Cleanup: Use the sidebar in the frontend to manually clean uploads

πŸ› Troubleshooting

Modal Deployment Issues

Server not responding:

# Check logs
modal app logs video-summary-generator

# Test health endpoint
curl https://your-url.modal.run/health

# Redeploy
modal deploy backend/modal_app.py

Cold start delays:

  • First request after inactivity may take 30-60 seconds (model loading)
  • This is normal for serverless deployments

GPU not available:

  • Edit backend/modal_app.py and set gpu=None for CPU-only

Frontend Connection Issues

Backend not connecting:

  • Verify BACKEND_API_URL is set correctly
  • Test Modal URL: curl https://your-url.modal.run/health
  • Check CORS (Modal handles this automatically)

πŸ“Š Performance Expectations

  • Frame Extraction: ~1-2 seconds per minute of video
  • Feature Extraction: ~2-5 seconds per frame (CPU) / ~0.5-1 second (GPU)
  • Caption Generation: ~1-2 seconds per key frame
  • Total Time:
    • CPU: ~5-10 minutes for a 1-minute video
    • GPU (T4): ~2-5 minutes for a 1-minute video

πŸ“ License

See LICENSE file for details.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“ž Support

For issues and questions:

πŸ™ Acknowledgments

  • GoogLeNet: Pre-trained model for feature extraction
  • R(2+1)D: Video understanding model for temporal analysis
  • BLIP-2: Salesforce's advanced image captioning model
  • Modal: Serverless GPU infrastructure
  • Streamlit: Rapid web app framework
  • OpenRouter: API gateway for LLM summarization

Note: This project is under active development. The pipeline may take several minutes to process videos depending on length and hardware capabilities. First request on Modal may experience cold start delays while models are loaded.

About

A Deep Learning based text summary generator from a video as input.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published