Mine my opinion? Part 4. Automatization!

Now, let’s automatize sentiment analysis! In our previous article , we discussed rule-based methods and how to implement them, but here we’ll teach you how to do it simpler and quicker. Automatic methods don’t need any handcrafted rules, their core are machine learning techniques. With them, you can process a large volume of new data without any delay and with no need for any adjustment. Think about rule-based systems where for any new data you need to add or change rules, update sentiment libraries, create even more complex systems and evidently lose time. Very quickly they tend to become monstrously complex and end up containing a rule for each word variety. That really isn’t something you should be proud of. Use your time smarter. Although, to be fair, if rule-based systems are correctly and frequently updated they are less prone to errors.

On the other hand, automatic systems are faster and much simpler. They don’t need a lot of maintenance and they can take into account how words are combined in a sentence. If you have a lot of new data, you can easily just rerun everything. These methods usually approach sentiment analysis as a classification problem. You feed the classifier with new text and it returns the possible category, such as positive, negative or neutral in a case of polarity analysis. In this article, we will show you how to create and use your own automatic sentiment classifier.

The basic procedure of creating and using automated sentiment analyzer

Extract features from the text - transform text into its numeric representation.
Train and test sentiment classifier - select and tune machine learning model.
Predict sentiment with your model.

Before we start coding, let’s just recall what we already know about feature extraction and text classification because every machine learning model has to have numbers as inputs. Feature extraction or text vectorization is a way of transforming a text into vectors. Usually, each vector component represents the frequency of each word or expression in a predefined dictionary (e.g. a lexicon of polarized words or just lexicon of most used words in the text). This classical approach is called Bag-of-words or Bag-of-ngrams and it is the most simple one to use.

n-gram is a contiguous sequence of n words or letters from a given sample of text.

Here is an example of 3-gram with function ngram() from nltk:

from nltk import ngrams

sentence = 'Kako je Krtek dobio hlače'
threegrams = ngrams(sentence.split(), 3)
list(threegrams)

Output:

[('Kako', 'je', 'Krtek'),
('je', 'Krtek', 'dobio'),
('Krtek', 'dobio', 'hlače')]

Other text vectorization techniques are based on word embedding, i.e. word vectors, where words or similar meaning have similar vector representation (more details in our next article). Once you transformed your text into numbers, the second step is classification. Classification involves one or more statistical models like Naïve Bayes, Logistic Regression, Support Vector Machines or Neural Networks. Always try a few of them and tune their parameters as good as you can.

But enough about theory, let’s start coding!

How to create an automated sentiment analyzer?

Everything always begins with data. As usual, get as much annotated data as you can. Here, we will use the same dataset as the last time to have comparable results (see rule-based opinion miner): Amazon Fine Food reviews with 20,000 reviews scoring from 1 to 5. Again, we will make it simpler by annotating them as positive (1), neutral (0) and negative (-1):

import pandas as pd
data = pd.read_csv("./Reviews.csv")
data = data.sample(frac=1)[:20000]

data.columns = map(lambda x:x.lower(), list(data))
data["text"] = data["summary"] + " "+ data["text"]
data = data[["text", "score"]]

data.loc[data.score<3, "score"] = -1
data.loc[data.score==3, "score"] = 0
data.loc[data.score>3, "score"] = 1

Then split the dataset on training (80%) and testing part (20%):

import random

sentiment_data = zip(data["text"], data["score"])
random.shuffle(sentiment_data)

# 80% for training
train_X, train_y = zip(*sentiment_data[:16000])

# Keep 20% for testing
test_X, test_y = zip(*sentiment_data[16000:])

Great! Now, we are ready to create a classifier. You’ll be quite surprised with the length of training/testing code. We’ll use nltk package for text preprocessing and scikit-learn for data mining. There is an amazing Class called Pipeline which assembles all the steps that we’ll use on our input data: CountVectorizer() as vectorizer with word ngrams and LinearSVC() as classifier.

from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC


clf = Pipeline([('vectorizer', CountVectorizer(analyzer="word", ngram_range=(1, 2),
tokenizer=word_tokenize, max_features=10000) ),
('classifier', LinearSVC())])

clf.fit(train_X, train_y)
clf.score(test_X, test_y)

Output:

0.86199999999999999

Short right? And really great accuracy! If you remember with rule-based we got an accuracy of 65% and the code was way longer and much more complex (check out our previous article). Here, we didn’t use any preprocessing, like stemming, removing nonalphabetic characters or stop words, marking negations or setting everything to lowercase and because of that our list of features is dreadful! You can check it out with get_feature_names():

clf.named_steps['vectorizer'].get_feature_names()

Output:

[u'kidding',
u'gives you',
u'other treats',
u'kids',
u'recall',
u'where the',
u"'s in",
u'the cheese',
u'years ago',
u'break',
u'lasted',
u'! ?',
u'got these',
u'options',
u"'s just",
u'purchased the',
u'says',
u'tangy',
u'can use',
u'taste good']

Nevertheless, accuracy is great thanks to our linear SVC model which linked on its own these horrible features to our categories. If you’d like you can spend some time on this problem and use some of the proposed preprocessing techniques or tune parameters better to get higher results, but we will continue on to show you how to use these systems.

How to use automated sentiment analyzer?

Simple as can be, just use created model with a function predict():

clf.predict(["I really like this tutorial!!", "I can't wait to learn something more."])

Output:

array([1, 1])

You can even save your model and use it in some other code without knowing anything about its structure, you just need to know input and output format. Module pickle has good functions dump() and load() and you can use them like this:

import pickle as cPickle

#Save classifier
with open('./Sentiment_linearSVC.pkl', 'wb') as file_id:
cPickle.dump(clf, file_id)

#Load classifier
with open('./Sentiment_linearSVC.pkl', 'rb') as file_id:
classifier = cPickle.load(file_id)

So, automatization has some great advantages like fast predicting and simple procedure without exhausting and frequent maintenance. More data that you have, results should be better until some point depending on the annotation accuracy.

Usually, a hybrid approach is the best one because it combines best features from the automatic system and the rule-based one. Meaning, you should tune a bit automatic procedure with a few of non-updatable rules. Add a bit of pre-processing and you can have quite a good sentiment analyser! But always have in mind that sentiment analysis is a really difficult task and its precision by default cannot be larger than 90%, so don’t try to chase those high numbers. You will never have the perfectly annotated data because people cannot agree in around 10% cases about someone’s opinion with just reading a text. You cannot know the whole context or know the tone of someone’s words (more in previous article), so it’s not weird that computer cannot do it as well.

This is it from our sentiment analysis tutorial, we hope that you enjoyed it! Previous parts of Mine my opinion? series can be found below:

Blog