This project implements an Image Captioning model using Deep Learning techniques. The model takes an image as input and generates a meaningful caption describing the image. This was done as part of my internship in Machine Learning.
The project was built from scratch, starting from dataset collection, preprocessing, feature extraction, model training, and finally testing and evaluation. The entire code was developed in VS Code and executed in Google Colab for efficient GPU usage.
The dataset used is Flickr8k, which contains 8,000 images with five captions per image. I downloaded the Flickr8k dataset from Kaggle, a popular platform for datasets and machine learning competitions. The dataset consists of:
- Flickr8k_Dataset (Images)
- Flickr8k_text (Captions file)
During the project, I created and managed multiple files in VS Code to ensure proper structuring of the project:
train.py → Contains the model training code. load_model_test.py → Loads the trained model and tests it on new images. extract_features.py → Extracts image features using a pretrained CNN. requirements.txt → Contains all dependencies required to run the project. README.md → This file, which documents the project. data/ → Stores the Flickr8k dataset and extracted features. models/ → Stores the trained model files (.keras and .h5). scripts/ → Stores helper scripts used during development.
- Loaded image captions from the text file.
- Performed text preprocessing: Lowercasing, removing punctuation, and tokenizing.
- Created a vocabulary and mapped each word to an index.
- Used a Pretrained CNN (InceptionV3) to extract image features.
- Saved the extracted features in a pickle file for faster processing.
- Used Tokenizer to encode captions into sequences.
- Applied Padding and created input-output sequences.
The model architecture consists of:
- CNN (InceptionV3): Extracts image features.
- LSTM (Long Short-Term Memory): Generates captions based on extracted features.
- Embedding Layer: Converts words into dense vectors.
- Used categorical cross-entropy loss and Adam optimizer.
- Trained for multiple epochs, monitoring loss improvement.
- Saved the model in both '.keras' and '.h5' formats.
- Loaded the trained model from saved files.
- Given an input image, generated a caption using Beam Search.
- Compared predicted captions with ground-truth captions.
-
Clone the repository: In powershell git clone https://github.com/DataProjectHub/CODSOFT_Image_Captioning.git cd CODSOFT_Image_Captioning
-
Install dependencies: In powershell pip install -r requirements.txt
-
Extract image features: In powershell python extract_features.py
-
Train the model: In powershell python train.py
-
Test the model: In powershell python load_model_test.py --image "sample.jpg"
- Handling large dataset sizes on local machines.
- Fine-tuning the LSTM model for better caption generation.
- Experimenting with different hyperparameters for better accuracy.
- Implement Transformer-based models like BLIP or ViT-GPT.
- Use larger datasets for more diverse captions.
- Deploy the model as a web application.
This project was a great learning experience in Computer Vision and Natural Language Processing (NLP). It helped in understanding how to integrate CNNs and LSTMs for sequential data processing. Looking forward to improving this model further!