Analysis and value prediction for jeopardy dataset
Data source: https://www.kaggle.com/tunguz/200000-jeopardy-questions
- Go to base directory and locate requirements.txt
- Run the command:
pip install -r requirements.txt
- Read and go through
notebooks/eda.ipynbfor feature engineering and transformations. - Change directory to
srcby command:cd src - Clean and transform data, by running script
clean_transform_data.pywith appropriate arguments. Command for running the scriptpython3 clean_transform_data.py <input_csv_file> <out_csv_file>- The
Air Datefeature is encoded using binary encoding, with 01/01/2000 as breakpoint. The reason for this is theorized and verified innotebooks/eda.ipynb - The text features i.e
Category,Question,Answerare cleaned of punctuation and stopwords.
- The
- Design matrix brief: The design matrix (The final feature matrix) is generated by concatenating encoded
Air DateandRoundto the appropriate text vectors.
After Cleaning the data, We can move onto training the models. We tried 3 differnt models. Each improving the error on previous model.
Train a baseline linear regression model by following the steps below:
- Move to
srcdirectory:cd src - Train linear regression:
python3.8 train_linear_regression.py <input_filepath> - We were able to minimize RMSE upto
332.76789076927827and806.6202793687579for training and test data respectively.
Now that we have our baseline. We move onto more complex models. We can tell from the errors reported above that the model is overtrained. We will try to mitigate this in our pursuit of best model.
Train a Random forest model by following the steps below:
- Move to
srcdirectory:cd src - Train random forest:
python3.8 train_random_forest.py <input_filepath> - We were able to minimize RMSE upto
526.7433033098621and538.565862801748for training and test data respectively.
As we can see there is definitly improvement on test set over linear regression. The model isn't overtrained, but can we reduce the error further? We will try to finetune the Hugging face pretrained tranformers in the next step.
Fine tune bert by following steps:
- Move to
srcdirectory:cd src - Train random forest:
python3.8 finetune.py <input_filepath> --epochs <num_epochs> --data_frac <fraction of data>- I used 5 epochs and 1.0 fraction of data(Complete dataset).
- We were able to minimize RMSE upto
46.65and46.16for training and test data respectively.
We see a big improvement with bert. However, this is a very big model and requires significant resources to train.