The primary objective of this project is to create a system that can understand and respond to natural language queries about video content, enabling users to find precise moments within videos without having to manually search through hours of footage.
Our work will be divided in three parts:
- In Phase 1, we establish the foundation by creating an index of video moments using Dual Encoders and OpenSearch.
- Phase 2 extends our system by integrating multimodal encoders and de- coders, enabling it to handle cross-modal queries and answer natural language questions about video content.
- Our objectives of Phase 3 are to detect moments from a video. Indeed, the user will enter the URL of the video they want to detect the moment and in the end, they will get each moment description with their start and end timestamps.
Throughout the report (/Reports), we document our methodology, implementation
details, challenges encountered, and the results of our experiments.
We also provide a critical analysis of our approach, discussing both its strengths
and limitations, and suggesting possible improvements for future work.
The source codes are available on the python notebook files: project-phase-1/2/3.ipynb