A deep learning-based application that generates text summaries from video inputs. The project features a Streamlit frontend for video upload and a FastAPI backend deployed on Modal that processes videos through a sophisticated ML pipeline to generate summaries.
- π₯ Video Upload Interface: User-friendly Streamlit frontend supporting
.mp4,.mkv, and.movformats - π§ Intelligent Frame Extraction: Automatically extracts and selects key frames from videos
- π€ Deep Learning Pipeline:
- Frame feature extraction using GoogLeNet
- Key frame selection using R(2+1)D video model
- Caption generation using BLIP-2 (Salesforce/blip2-opt-2.7b)
- Optional API-based final summarization with OpenRouter
- βοΈ Cloud Deployment: Backend hosted on Modal (serverless with GPU support), frontend on Streamlit Cloud
- π§Ή Automatic Cleanup: Temporary files and old uploads are automatically managed
- π Secure Configuration: Environment-based configuration for API keys and secrets
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Streamlit β β Modal β β ML Pipeline β
β Frontend ββββββββββΆβ Backend ββββββββββΆβ (PyTorch) β
β (Cloud/Local) β HTTP β (FastAPI) β β β’ GoogLeNet β
β β β β’ GPU Support β β β’ R(2+1)D β
β β β β’ Auto-scaling β β β’ BLIP-2 β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
- Streamlit Application: Web-based UI for video upload and result display
- Handles file upload, displays processing status, and manages user sessions
- Automatic cleanup of old uploads (24+ hours)
- Configurable backend API URL (Modal deployment)
- FastAPI Server: RESTful API for video processing
- Modal Deployment: Serverless deployment with GPU support (T4/A10G)
- ML Pipeline: Complete deep learning workflow for video-to-text summarization
- Video frame extraction (OpenCV)
- Feature extraction (GoogLeNet)
- Key frame selection (R(2+1)D video model)
- Caption generation (BLIP-2)
- Optional API summarization (OpenRouter)
PyTorch - Deep learning framework
Torchvision - Pre-trained models (GoogLeNet, R(2+1)D)
Transformers (HuggingFace) - BLIP-2 model
- Python: 3.8+
- GPU: Optional but recommended for faster processing (T4 available on Modal)
- Memory: Minimum 4GB RAM (8GB+ recommended for local development)
- Disk Space: For video storage and model cache (~5-10GB for models)
git clone <repository-url>
cd Video-Summary-Generatorpython -m venv .venv
# On Windows
.venv\Scripts\activate
# On Linux/Mac
source .venv/bin/activatepip install -r requirements.txtBackend (local):
cd backend
uvicorn main:app --reload --host 0.0.0.0 --port 8000Frontend (local):
cd app
# Set backend URL (for local backend)
export BACKEND_API_URL=http://localhost:8000 # Linux/Mac
# OR
set BACKEND_API_URL=http://localhost:8000 # Windows CMD
# OR
$env:BACKEND_API_URL="http://localhost:8000" # Windows PowerShell
streamlit run main.py- Install Modal CLI:
pip install modal
modal token new # Authenticate- Deploy Backend:
modal deploy backend/modal_app.py- Get Your Modal URL: After deployment, Modal provides a URL like:
https://your-username--video-summary-generator-fastapi-app.modal.run
- Configure Secrets (Optional):
modal secret create env \
API_URL=https://your-api-url.com \
API_KEY=your_api_key \
MODEL_NAME=tngtech/deepseek-r1t2-chimera:free-
Push code to GitHub
-
Configure Streamlit Cloud:
- Go to streamlit.io/cloud
- Click "New app"
- Connect your GitHub repository
- Set Main file path:
app/main.py
-
Add Secrets: In Streamlit Cloud settings, add:
BACKEND_API_URL = "https://your-modal-url.modal.run"
-
Deploy!
You can run the frontend locally and connect it to your Modal backend:
# Set Modal backend URL
export BACKEND_API_URL="https://your-username--video-summary-generator-fastapi-app.modal.run"
# Run frontend
cd app
streamlit run main.py- Upload Video: Use the Streamlit frontend to upload a video file (
.mp4,.mkv, or.mov) - Process: Click "Process Video" button
- Wait: Processing can take 5-30 minutes depending on video length (first request loads models)
- View Results: The generated summary will be displayed
You can also call the Modal API directly:
curl -X POST "https://your-modal-url.modal.run/process_upload" \
-H "accept: application/json" \
-F "file=@your_video.mp4"Response:
{
"status": "success",
"run_id": "uuid-here",
"summary": "Generated summary text...",
"original_filename": "your_video.mp4",
"file_size_bytes": 12345678
}Video-Summary-Generator/
β
βββ app/ # Streamlit Frontend
β βββ main.py # Main Streamlit application
β βββ uploads/ # User uploads (auto-created, gitignored)
β βββ .streamlit/ # Streamlit configuration
β βββ secrets.toml # Secrets for local (gitignored)
β
βββ backend/ # FastAPI Backend
β βββ main.py # FastAPI application
β βββ pipeline.py # ML pipeline implementation
β βββ modal_app.py # Modal deployment configuration
β βββ uploads/ # Processed videos (gitignored)
β βββ .env # Environment variables (gitignored)
β
βββ notebooks/ # Development notebooks
β βββ notebook1.ipynb # Original pipeline development
β
βββ requirements.txt # Root-level dependencies
βββ .gitignore # Git ignore rules
βββ LICENSE # License file
βββ README.md # This file
Local (Environment Variable):
export BACKEND_API_URL="https://your-modal-url.modal.run"Local (Streamlit secrets):
Create app/.streamlit/secrets.toml:
BACKEND_API_URL = "https://your-modal-url.modal.run"Streamlit Cloud:
Add secret in dashboard with key BACKEND_API_URL
Modal Secrets: Set via Modal dashboard or CLI:
modal secret create env \
API_URL=https://your-api-url.com \
API_KEY=your_api_key \
MODEL_NAME=tngtech/deepseek-r1t2-chimera:freeNote: API configuration is optional. If not provided, the pipeline returns combined frame captions without final API summarization.
Customize in backend/pipeline.py:
frame_skip: Extract every Nth frame (default: 30)importance_threshold: Threshold for frame selection (default: 0.5)
Edit backend/modal_app.py:
gpu: GPU type ("T4", "A10G", "A100", orNonefor CPU)timeout: Request timeout in seconds (default: 1800)memory: Memory allocation in MB (default: 8192)
Health check endpoint.
Response:
{
"status": "ok",
"message": "Video Summary Generator API is running",
"version": "1.0.0"
}Detailed health check.
Response:
{
"status": "healthy",
"uploads_directory": "/path/to/uploads",
"uploads_directory_exists": true
}Process an uploaded video and generate a summary.
Request:
- Method:
POST - Content-Type:
multipart/form-data - Body: Video file (
.mp4,.mkv, or.mov) - Max file size: 500 MB
Response:
{
"status": "success",
"run_id": "uuid-here",
"summary": "Generated summary text...",
"original_filename": "video.mp4",
"file_size_bytes": 12345678
}Error Responses:
400: Unsupported file format or invalid request413: File size exceeds limit500: Processing error
The pipeline consists of several stages:
- Frame Extraction: Uses OpenCV to extract frames at configurable intervals (every 30th frame by default)
- Feature Extraction: GoogLeNet extracts 1024-dimensional feature vectors from each frame
- Key Frame Selection: R(2+1)D video model analyzes temporal clips to select ~25 most important frames
- Caption Generation: BLIP-2 (2.7B parameter model) generates detailed natural language captions for key frames
- Summarization: Optional OpenRouter API call for final summary refinement and narrative coherence
The application includes automatic cleanup features:
- Frontend: Old uploads (24+ hours) are automatically removed on app start
- Backend: Temporary processing files are cleaned up after each run
- Manual Cleanup: Use the sidebar in the frontend to manually clean uploads
Server not responding:
# Check logs
modal app logs video-summary-generator
# Test health endpoint
curl https://your-url.modal.run/health
# Redeploy
modal deploy backend/modal_app.pyCold start delays:
- First request after inactivity may take 30-60 seconds (model loading)
- This is normal for serverless deployments
GPU not available:
- Edit
backend/modal_app.pyand setgpu=Nonefor CPU-only
Backend not connecting:
- Verify
BACKEND_API_URLis set correctly - Test Modal URL:
curl https://your-url.modal.run/health - Check CORS (Modal handles this automatically)
- Frame Extraction: ~1-2 seconds per minute of video
- Feature Extraction: ~2-5 seconds per frame (CPU) / ~0.5-1 second (GPU)
- Caption Generation: ~1-2 seconds per key frame
- Total Time:
- CPU: ~5-10 minutes for a 1-minute video
- GPU (T4): ~2-5 minutes for a 1-minute video
See LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
For issues and questions:
- Open an issue on the repository
- Check Modal documentation: modal.com/docs
- Check Streamlit documentation: docs.streamlit.io
- GoogLeNet: Pre-trained model for feature extraction
- R(2+1)D: Video understanding model for temporal analysis
- BLIP-2: Salesforce's advanced image captioning model
- Modal: Serverless GPU infrastructure
- Streamlit: Rapid web app framework
- OpenRouter: API gateway for LLM summarization
Note: This project is under active development. The pipeline may take several minutes to process videos depending on length and hardware capabilities. First request on Modal may experience cold start delays while models are loaded.