This project aims to automate customer support ticket classification for a financial company using NLP techniques. The objective is to classify unstructured customer complaints into predefined categories to streamline the support process. The categories include:
- Credit card/Prepaid card
- Bank account services
- Theft/Dispute reporting
- Mortgages/loans
- Others
The notebook follows a structured workflow, encompassing the following major stages:
- Data Reading and Understanding
- Data Cleaning
- Text Preprocessing
- Exploratory Data Analysis (EDA)
- Feature Extraction
- Topic Modeling
- Model Building
- Best Model
- Model Inference
The data provided is in JSON format, and it is efficiently loaded into a Pandas DataFrame for further processing.
| Topic | Count | Label | Top Words |
|---|---|---|---|
| 0 | 5787 | Bank account services | account, check, bank, chase, money, deposit, fund, close, tell, branch, open, say, chase bank, day, transaction, checking, transfer, claim, close account, number |
| 1 | 6134 | Credit card / Prepaid card | card, charge, credit card, credit, chase, dispute, purchase, use, receive, merchant, chase credit card, transaction, refund, service, chase credit card, company, fraud, tell, say, pay |
| 2 | 4132 | Mortgages / Loans | loan, mortgage, chase, home, modification, property, letter, send, document, year, foreclosure, request, time, sale, receive, house, rate, tell, pay, loan modification |
| 3 | 2535 | Theft / Dispute reporting | report, credit, inquiry, credit report, hard, remove, hard inquiry, inquiry credit, account, inquiry credit report, bureau, debt, credit bureau, reporting, score, card, information, hard inquiry credit, hard inquiry credit report, identity |
| 4 | 2564 | Others | payment, pay, late, make, fee, balance, make payment, month, late payment, late fee, statement, payment make, account, monthly, chase, credit, payment chase, time, day, auto |
The following models were trained to classify the customer complaints:
- Logistic Regression
- Random Forest
- Naive Bayes
- Decision Tree
Logistic Regression outperformed the other models across all key metrics, demonstrating superior accuracy, precision, and recall. The performance ranking of the models based on overall effectiveness is:
- Logistic Regression - Best performing model with the highest scores across accuracy, precision, recall, and F1-score.
- Random Forest - Good performance but slightly lower than Logistic Regression in all metrics.
- Naive Bayes - Moderate performance with a balance of precision and recall.
- Decision Tree - Lowest performance, prone to overfitting, resulting in lower generalization.
| Model | Accuracy | Precision | Recall | F1-score* | ROC AUC | Confidence |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.9618 | 0.9624 | 0.9618 | 0.9617 | 0.9987 | 0.8623 |
| Random Forest | 0.8849 | 0.8894 | 0.8849 | 0.8831 | 0.9899 | 0.6782 |
| Naive Bayes | 0.8705 | 0.8712 | 0.8705 | 0.8685 | 0.9830 | 0.7967 |
| Decision Tree | 0.8204 | 0.8204 | 0.8204 | 0.8203 | 0.8831 | 1.0000 |
*sorted on F1-score
The Logistic Regression model was selected as the best-performing model due to its high accuracy and robust generalization capabilities. Future improvements could include hyperparameter tuning and exploring deep learning approaches for further enhancements.
Install the required dependencies using:
pip install -r requirements.txt
