Skip to content

saatweek/Coronavirus_tweets_NLP_Text_Classification

Repository files navigation

Coronavirus Tweets NLP Text Classification

Contents

Libraries Used

  • Pandas
  • Tensorflow 2.x
  • Numpy
  • Plotly

CONTEXT
This dataset contains the Tweets of users who have applied the following hashtags: #coronavirus, #coronavirusoutbreak, #coronavirusPandemic, #covid19, #covid_19

From about 17 March, the dataset also included the following additional hashtags: #epitwitter, #ihavecorona

This is the first dataset in the series, as the Data tab only displays 20 files at a time and I have been uploading files with a single day's worth of data. To ensure that all files are visible to users and no files are too large, it seems prudent to create a second dataset to split the files into manageable groups of approximately half a month. The first file also contains a file that matches country with country_code and may be useful for users.

CONTENT

The dataset contains variables associated with Twitter: the text of various tweets and the accounts that tweeted them, the hashtags used and the locations of the accounts.

Note that due to the large volume of Tweets, there may be some gaps for some hashtags (not all Tweets with a given hashtag may be captured). Because some hashtags are used less frequently than other hashtags, less frequently used hashtags may span a longer period of time (going back earlier) than more frequently used hashtags. The hashtag "#coronavirus" seems to be the most frequently used - despite scraping 500,000 Tweets, there was no overlap between Tweets with this hashtag in version 1 and version 5, therefore gaps remain.

The retweets argument has been set to FALSE, so this dataset does not include retweets (although a count of retweets is provided as a variable).


Approaches Used

  • Embedding
    Embedding is probably the most basic approach to classify text. Wikipedia defines it as

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension.

I think of it as assigning various properties to a word. The actual properties assigned are definitely unknown, but just for the sake of explanation, we'll define 3 categories, i.e., cute, funny, toxic. And then, with each word, we'll assign a vector with 3 values which tell us how cute, funny and toxic that word is. So for example, mask can be [0.5, 0.3, 0.002] and Covid might be [0.001, 0.02, 0.98] and Distancing could be [0.3, 0.45, 0.0001].
What it does is, that the similar words then have similar values, and the distinct words with opposite meaning will have wildly different values.
If the sentence majorly consists of funny/cute words then the sentence would be classified as good, otherwise bad
Link to the code with the explanation of each line is here

  • 1-D Convolution
    1-D Convolution works further on the embedding matrix and passes a filter over the embedding matrix of the entire sentence, essentially extracting some essential featres from the bigger embedding matrix and condensing into a smaller matrix which is faster to train and often, more accurate.
    Detailed code here

  • Bi-Directional LSTM
    Bi-Directional LSTMs are basically really advanced RNN's where the "memory" of both the previous words, and the upcoming words are reatined and passed onto the Recurrent Neural Network. This provides additional context (as compared to normal RNN) and results in a more accurate classification than that of a regular RNN, bidirectional RNN, GRU's, embeddings and 1-D Convolutions
    Link to the code is here

  • DistilBERT (using ktrain)
    BERT stands for Bidirectional Encoder Representations from Transformers.
    DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.
    This is being run by ktrain, which is a lightweight wrapper used to help build, train, and deploy neural networks and other machine learning models (DistilBERT in this case)
    Link to the code is here

Accuracy Comparison

Accuracy Comparison

Accuracy on Validation Data

About

Perform text Classification on multi-label data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published