Image Captioning using Deep Learning

Introduction

This project implements an Image Captioning model using Deep Learning techniques. The model takes an image as input and generates a meaningful caption describing the image. This was done as part of my internship in Machine Learning.

Project Overview

The project was built from scratch, starting from dataset collection, preprocessing, feature extraction, model training, and finally testing and evaluation. The entire code was developed in VS Code and executed in Google Colab for efficient GPU usage.

Dataset Used

The dataset used is Flickr8k, which contains 8,000 images with five captions per image. I downloaded the Flickr8k dataset from Kaggle, a popular platform for datasets and machine learning competitions. The dataset consists of:

Flickr8k_Dataset (Images)
Flickr8k_text (Captions file)

Files Created in VS Code##

During the project, I created and managed multiple files in VS Code to ensure proper structuring of the project:

train.py → Contains the model training code. load_model_test.py → Loads the trained model and tests it on new images. extract_features.py → Extracts image features using a pretrained CNN. requirements.txt → Contains all dependencies required to run the project. README.md → This file, which documents the project. data/ → Stores the Flickr8k dataset and extracted features. models/ → Stores the trained model files (.keras and .h5). scripts/ → Stores helper scripts used during development.

Steps to Build the Model

1. Preprocessing the Dataset

Loaded image captions from the text file.
Performed text preprocessing: Lowercasing, removing punctuation, and tokenizing.
Created a vocabulary and mapped each word to an index.

2. Feature Extraction from Images

Used a Pretrained CNN (InceptionV3) to extract image features.
Saved the extracted features in a pickle file for faster processing.

3. Preparing the Data for Training

Used Tokenizer to encode captions into sequences.
Applied Padding and created input-output sequences.

4. Building the Model

The model architecture consists of:

CNN (InceptionV3): Extracts image features.
LSTM (Long Short-Term Memory): Generates captions based on extracted features.
Embedding Layer: Converts words into dense vectors.

5. Training the Model

Used categorical cross-entropy loss and Adam optimizer.
Trained for multiple epochs, monitoring loss improvement.
Saved the model in both '.keras' and '.h5' formats.

6. Testing the Model

Loaded the trained model from saved files.
Given an input image, generated a caption using Beam Search.
Compared predicted captions with ground-truth captions.

Running the Project

Setup

Clone the repository: In powershell git clone https://github.com/DataProjectHub/CODSOFT_Image_Captioning.git cd CODSOFT_Image_Captioning
Install dependencies: In powershell pip install -r requirements.txt
Extract image features: In powershell python extract_features.py
Train the model: In powershell python train.py
Test the model: In powershell python load_model_test.py --image "sample.jpg"

Challenges Faced

Handling large dataset sizes on local machines.
Fine-tuning the LSTM model for better caption generation.
Experimenting with different hyperparameters for better accuracy.

Future Enhancements

Implement Transformer-based models like BLIP or ViT-GPT.
Use larger datasets for more diverse captions.
Deploy the model as a web application.

Final Thoughts

This project was a great learning experience in Computer Vision and Natural Language Processing (NLP). It helped in understanding how to integrate CNNs and LSTMs for sequential data processing. Looking forward to improving this model further!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Image Captioning using Deep Learning

Introduction

Project Overview

Dataset Used

Files Created in VS Code##

Steps to Build the Model

1. Preprocessing the Dataset

2. Feature Extraction from Images

3. Preparing the Data for Training

4. Building the Model

5. Training the Model

6. Testing the Model

Running the Project

Setup

Challenges Faced

Future Enhancements

Final Thoughts

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
models		models
README.md		README.md
extract_features.py		extract_features.py
load_model_test.py		load_model_test.py
requirements.txt		requirements.txt
train.py		train.py

DataProjectHub/CODSOFT_Image_Captioning

Folders and files

Latest commit

History

Repository files navigation

Image Captioning using Deep Learning

Introduction

Project Overview

Dataset Used

Files Created in VS Code##

Steps to Build the Model

1. Preprocessing the Dataset

2. Feature Extraction from Images

3. Preparing the Data for Training

4. Building the Model

5. Training the Model

6. Testing the Model

Running the Project

Setup

Challenges Faced

Future Enhancements

Final Thoughts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages