A year ago I had written a paper for a Neural Networks class that I hadn't gotten around to publish. I decided to take a small break from most of my hacking posts to talk a bit about Machine Learning. This paper was a continuation of some previous work I had done (outlined in this past post) regarding Sentiment Analysis of Twitter data. (I recommend taking a look at that post if you are new to Neural Networks)
This is a shorter version of the research paper I wrote, so feel free to check that out if you want to go into more details. Also, if you only care about the implementation check out my Github project.
* Recently revived my old Twitter account. Follow me for more interesting content!
Twitter is now a platform that hosts about 350 million active users, which post around 500 million tweets per day! It has become a direct link between companies/organizations and their customers, and such it is being used to build branding, understand customer demands, and better communicate with them. From a data scientist point of view, Twitter is a gold mine that can be used, among a million other interesting things, for gauging customer sentiment towards a brand.
My personal stake in this project stemmed from my curiosity to better understand Neural Networks, particularly CNN and LSTMs. In a previous class, I had created simple Feed-Forward Neural Networks to solve this very problem, however I knew that my results could be substantially better when harnessing the power of more specialized networks.
Furthermore, I wanted to do something I hadn't really seen other people do and I was curious about the results of combining these two networks.
Intuition: Why CNNs and LSTMs?
Before starting, let's give a brief introduction to these networks along with a short analysis of why I thought they would benefit my sentiment analysis task.
Convolutional Neural Networks (CNNs) are networks initially created for image-related tasks that can learn to capture specific features regardless of locality.
For a more concrete example of that, imagine we use CNNs to distinguish pictures of Cars vs. pictures of Dogs. Since CNNs learn to capture features regardless of where these might be, the CNN will learn that cars have wheels, and every time it sees a wheel, regardless of where it is on the picture, that feature will activate.
In our particular case, it could capture a negative phrase such as "don't like" regardless of where it happens in the tweet.
- I don't like watching those types of films
- That's the one thing I really don't like.
- I saw the movie, and I don't like how it ended.
Long-Term Short Term Memory (LSTMs) are a type of network that has a memory that "remembers" previous data from the input and makes decisions based on that knowledge. These networks are more directly suited for written data inputs, since each word in a sentence has meaning based on the surrounding words (previous and upcoming words).
In our particular case, it is possible that an LSTM could allow us to capture changing sentiment in a tweet. For example, a sentence such as: At first I loved it, but then I ended up hating it. has words with conflicting sentiments that would end-up confusing a simple Feed-Forward network. The LSTM, on the other hand, could learn that sentiments expressed towards the end of a sentence mean more than those expressed at the start.
The Twitter data used for this particular experiment was a mix of two datasets:
- The University of Michigan Kaggle competition dataset.
- The Neik Sanders Twitter Sentiment Analysis corpus.
In total these datasets contain 1,578,627 labeled tweets.
The first model I tried was the CNN-LSTM Model. Our CNN-LSTM model combination consists of an initial convolution
layer which will receive word embeddings as input. Its output will then be pooled to a smaller dimension which is then fed into an LSTM layer. The intuition behind this model is that the convolution layer will extract local features and the LSTM layer will then be able to use the ordering of said features to learn about the input’s text ordering. In practice, this model is not as powerful as our other LSTM-CNN model proposed.
Our CNN-LSTM model consists of an initial LSTM layer which will receive word embeddings for each token in the tweet as inputs. The intuition is that its output tokens will store information not only of the initial token, but also any previous tokens; In other words, the LSTM layer is generating a new encoding for the original input. The output of the LSTM layer is then fed into a convolution layer which we expect will extract local features. Finally the convolution layer’s output will be pooled to a smaller dimension and ultimately outputted as either a positive or negative label.
We setup the experiment to use training sets of 10,000 tweets and testing sets of 2,500 labeled tweets.These training and testing sets contained equal amounts of negative and positive tweets. We re-did each test 5 times and reported on the average results of these tests.
We used the following parameters which we fine-tuned through manual testing:
The actual results were as follows:
Our CNN-LSTM model achieved an accuracy of 3% higher than the CNN model, but 3.2% worse than the LSTM model. Meanwhile, our LSTM-CNN model performed 8.5% better than a CNN model and 2.7% better than an LSTM model.
These results seem to indicate that our initial intuition was correct, and that by combining CNNs and LSTMs we are able to harness both the CNN’s ability in recognizing local patterns, and the LSTM’s ability to harness the text’s ordering. However, the ordering of the layers in our models will play a crucial role on how well they perform.
We believe that the 5.5% difference between our models is not coincidental. It seems that the initial convolutional layer of our CNN-LSTM is loosing some of the text’s order / sequence information. Thus, if the order of the convolutional layer does not really give us any information, the LSTM layer will act as nothing more than just a fully connected layer. This model seems to fail to harness the full capabilities of the LSTM layer and thus does not achieve its maximum potential. In fact, it even does worse than a regular LSTM model.
On the other hand, the LSTM-CNN model seems to be the best because its initial LSTM layer seems to act as an encoder such that for every token in the input there is an output token that contains information not only of the original token, but all other previous tokens. Afterwards, the CNN layer will find local patterns using this richer representation of the original input, allowing for better accuracy.
Some of the observations I made during the testing (and which are explained in much more detail on the paper):
CNN & CNN-LSTM models need more epochs to learn and overfit less quickly, as opposed to LSTM & LSTM-CNN models.
This wasn't so much of a surprise, but I did notice that it is very important to add a Dropout layer after any Convolutional layer in both the CNN-LSTM and LSTM-CNN models.
(Note: in this case Dropout Prob. is how likely we are to drop a random input.)
Pre-Trained Word Embeddings
I attempted to use pre-trained word embeddings, as opposed to having the system learn the word embeddings form our data. Surprisingly using these pre-trained GloVe word embeddings gave us worst accuracy. I believe this might be due to the fact that twitter data contains multiple misspellings, emojis, mentions, and other twitter-specific text irregularities that weren't taken into consideration when building the GloVe embeddings.
Conclusions & Future Work
In terms of future work, I would like to test other types of LSTMs (for example Bi-LSTMs) and see what effects this has on the accuracy of our systems. It would also be interesting to find a better way to deal with misspellings or other irregularities found on twitter language. I believe this could be achieved by building Twitter specific word-embeddings. Lastly, it would be interesting to make use of Twitter specific features, such as # of retweets, likes, etc. to feed along the text data.
On a personal note, this project was mainly intended as an excuse to further understand CNN and LSTM models, along with experimenting with Tensorflow. Moreover, I was happy to see that these two models did much better than our previous (naive) attempts.