Fake Review Detection Project

Welcome to the comprehensive documentation for our Fake Review Detection project. This project is designed to distinguish between computer-generated (fake) reviews and original (human-written) reviews by leveraging a variety of natural language processing (NLP) techniques, traditional machine learning models, and deep learning methods. In this document, we detail the project background, data processing, feature engineering, model experimentation, experiment tracking using MLflow, and finally, the deployment of our application.

Project Overview

Fake reviews are a growing problem in the online ecosystem, impacting consumer trust and business reputations. Our project aims to automatically detect and flag these computer-generated reviews by analyzing textual content. To achieve this, we have employed various techniques including:

Text Preprocessing: Cleaning and normalizing text data.
Feature Engineering: Extracting useful metrics such as lexical diversity, sentiment scores, and syntactic patterns.
Embeddings: Generating different embeddings using methods such as TF-IDF, Count Vectorization, and precomputed embeddings from models like BERT and GloVe.
Model Training: Experimenting with a range of traditional machine learning models (Logistic Regression, Random Forest, SVC) and a more complex deep learning model using a two-layer LSTM.
Experiment Tracking: Logging every experiment with detailed metrics, hyperparameters, and artifacts using MLflow.

Data Collection & Preprocessing

Our data consists of review texts along with corresponding labels indicating whether a review is computer-generated or original. The preprocessing pipeline includes:

Text Cleaning: Removing unwanted characters, punctuation, and noise.
Normalization: Converting text to lower case, tokenizing sentences and words, and applying techniques such as stemming, lemmatization, and stop word removal.
Feature Extraction: Calculating metrics like lexical diversity, average word length, sentiment polarity, subjectivity, Flesch Reading Ease, sentence length, and various part-of-speech counts.

Processed data files are stored in our ../Data/Feature-Engineered/ folder.

Feature Engineering & Embeddings

To capture the nuances of textual data, we explored several embedding techniques:

TF-IDF Embeddings: Transforming text into weighted term-frequency representations.
Count Vectorization: Creating basic term-frequency vectors.
Precomputed Embeddings: Using models such as BERT and GloVe to generate embeddings, which are stored as CSV files in the ../../embeddings/ folder.

These methods provide diverse representations of the data, enabling our models to learn both syntactic and semantic patterns.

You can view all our datasets created throught out the process here:
Drive link for all the datasets used

Modeling Approaches

Traditional Machine Learning Models

We experimented with several scikit-learn models:

Logistic Regression
Random Forest Classifier
Support Vector Classifier (SVC)

For each model, we performed hyperparameter tuning using GridSearchCV with feasible parameter grids. Experiment results, including confusion matrices and metrics (accuracy, precision, recall, F1 score), are logged in MLflow.

Deep Learning Models

To capture complex patterns, we built a deep learning model using TensorFlow Keras:

Two-Layer LSTM: The model includes an embedding layer, two LSTM layers, and dense layers with dropout for regularization.
Text Tokenization & Padding: We convert raw text into sequences using Keras’ Tokenizer and pad them to a uniform length.
Evaluation: Model performance is evaluated on standard metrics and confusion matrices are logged.

Experiment Tracking with MLflow

Our experiments are fully tracked using MLflow. For every run, we log:

Parameters: File names, model types, hyperparameters, and embedding types.
Metrics: Accuracy, precision, recall, and F1 score.
Artifacts: Confusion matrices (as PNG images) and model artifacts.
Datasets: Using the mlflow.data API, our dataset information is logged and appears under the MLflow UI's "Datasets" tab (for MLflow ≥ 2.4). When unavailable, the CSV files are logged as artifacts.

You can view all our experiments on Dagshub through MLflow here:
View MLflow Experiments on Dagshub

A progress log (progress_log.csv) is maintained to ensure experiments are not re-run unnecessarily.

Deployment

The Fake Review Detection web application is deployed and accessible online. Users can enter review text to receive predictions on whether the review is computer-generated or original. The application also provides various text analytics and visualizations for better interpretability.

Access the deployed web app here:
Fake Review Detection Web App

Docker Image

We provide a Docker image for easy deployment of the project. The image includes all necessary code and dependencies.

To download and run the Docker image:

Pull the Docker Image:

docker pull malhar2460/fake_review_detection:latest

Run the Docker Container:

docker run -p 8501:8501 malhar2460/fake_review_detection:latest

Your application will then be accessible at http://localhost:8501.

For more details, refer to our Docker Hub repository: Docker Hub: malhar2460/fake_review_detection

Conclusion

Our project integrates advanced NLP techniques, comprehensive feature engineering, rigorous experimentation, and robust MLflow-based tracking to build a reproducible system for fake review detection. This documentation provides an end-to-end overview of our process, from data preprocessing and model training to deployment. We encourage you to explore the code repository and MLflow experiment dashboard for more details.

Thank you for your interest in our project!

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.github/workflows		.github/workflows
Artifacts		Artifacts
Assets		Assets
Data		Data
Notebooks		Notebooks
Reports		Reports
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dockerfile		dockerfile
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fake Review Detection Project

Table of Contents

Project Overview

Data Collection & Preprocessing

Feature Engineering & Embeddings

Modeling Approaches

Traditional Machine Learning Models

Deep Learning Models

Experiment Tracking with MLflow

Deployment

Docker Image

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

malhar2460/Fake-review-detection

Folders and files

Latest commit

History

Repository files navigation

Fake Review Detection Project

Table of Contents

Project Overview

Data Collection & Preprocessing

Feature Engineering & Embeddings

Modeling Approaches

Traditional Machine Learning Models

Deep Learning Models

Experiment Tracking with MLflow

Deployment

Docker Image

Conclusion

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages