CIO

5 tools and techniques for text analytics

Data mining expert lays out some useful tools and techniques from sentiment analysis to topic modeling and natural language processing

There’s a proliferation of unstructured data on the Internet and coming into customer call centres. But manually going through the haystack to find the needle is an insurmountable, unrealistic task to complete.

Speaking at the recent Big Data TechCon event in Boston, data mining expert, Dan Sullivan from Cambia Health Solutions, discussed several tools and techniques to get you started on effectively mining text data and extracting the rich insights it can bring.

1. Sentiment analysis

Analysing the opinion or tone of what people are saying about your company on social media or through your call centre can help you respond to issues faster, see how your product and service is performing in the market, find out what customers are saying about competitors, and so on.

There are three ways of going about this kind of sentiment analysis, Sullivan said. The first is polarity analysis, where you simply identify if the tone of communications is positive or negative. The second level is categorisation, where tools get more fine-grained and identify if someone’s confused or angry, for example. Then there’s putting a scale on emotion from ‘sad’ to ‘happy’ and from 0-10.

Sullivan said Affective Norms for English Words (ANEW) is useful for emotional ratings, and ranks words in terms of their pleasure, arousal and dominance. This allows communication to be identified in more detail, such as mild concern or somewhat angry.

WordNet is another tool that relates words similar to each other, such as synonyms and antonyms, and allows users to build classification schemes using that semantic information to do semantic analysis.

“This is to do with semantic or concept-based classifications and was developed by linguists. It’s basically a ontology of English words,” Sullivan said.

Read: How Google’s open source tool Word2Vec can help address some of the challenges of doing text classification with neural nets

Other tools to get started with include the Natural Language Toolkit and TextBlog in Python, which is free to use. Commercial tools available include RapidMiner or ViralHeat, and many others, for doing sentiment analysis, Sullivan said.

Picking up on sarcasm and irony in sentiment analysis, however, remains a challenge. “That’s a problem, especially with things like tweets and social media where people are ironic and sarcastic because that’s a way to get a message across,” Sullivan said.

“There can be an opposite sentiment, where one is very negative and the other very positive. That’s usually an indication of sarcasm. For example, coffee was watery, I really love the new blend at Starbucks.”

Companies also need to pay attention to the context of social media posts and other forms of customer communication, Sullivan said.

“You don’t want to, for example, just capture a tweet that just says, ‘I really hate company X and their product sucks.’ Contextual information is really important to help us understand why the tone might be positive or negative,” he said. “So metadata is crucial. It’s not just the physical 140 characters you want to keep track of.

“Was the person replying to another negative tweet? Was this the original composition? What was the geographic location?”

2. Topic modeling

Topic modeling is a useful technique for identifying dominant themes in a vast array of documents and for dealing with a large corpus of text. Legal firms, for example, might have to go through millions of documents used in big litigation cases. This is where topic modeling can come in handy, Sullivan said.

There are a couple of ways to go about topic modeling. One is latent dirichlet allocation, where words are automatically clustered into topics, with a mixture of topics in each document. The other is probabilistic latent semantic indexing, which models co-occurrence data using probability.

“The basic idea is we have these documents about topics. You can figure out what the topics are based the words that are used in the document,” Sullivan explained. “Given a particular document, what is the probability that a certain topic is covered in that document? And given a certain topic, what is the likelihood that a particular word would be used about that?

“The way these algorithms work is kind of iterative. There are many iterations of taking guesses about what words were associated with what topics and the algorithms basically hone the best set of combinations of words for topics and topics for documents. It works really well.”

Sullivan used the homepage of the New York Times website on 27 April to show how topic modeling could go through and identify that one article was about student debt, law and graduation; another was on government debt, EU and the Euro; and a third discussed Greece, political negotiation and Greek finance ministers.

Topic modeling can also give a weight on the importance on each topic in each article. For example, the first article might be 50 per cent about student debt, 30 per cent about graduation and 20 per cent about law.

Some useful tools for topic modeling include Standford Topic Modeling Toolbox, Mallet (UMass Amherst), R package ‘topicmodels’ and Python package Gensim, Sullivan said.

“The Stanford and Mallet are Java-based tools. You don’t have to be a Java programmer, you can use these tools by running in the command line [versions],” he added.

One downside of topic modelling is that it’s not easily scalable, Sullivan said.

“If you are doing large document sets, one of the things you might want to do is use topic modeling for subsets or samples that have good representation of the entire set,” he said. “That might give you a sense of the topics, then you can do clustering to break them down in to smaller topics and do more detailed analysis on each of the clusters. That’s one way to deal with the scalability issue of topic modeling.”

3. Term frequency – inverse document frequency

TF–IDF looks at how frequently a word appears in a document and its importance relative to the whole set of documents.

“Words that appear frequently in a lot of documents may not be very useful, like ‘the’, ‘a’. But if there are words that show up frequently in stories about the Greek debt crisis but not about something else like the elections, for example, then those are useful words to keep track of. And that’s what TF–IDF captures,” Sullivan explained.

This can be used to build classifiers or predictive models, he said. For example, a company that has about 10 years’ worth of customer call centre dialogue that has been transcribed into text could tap into this data and figure out what it all says. To do this, Sullivan said the calls could be classified into ‘conversations with customers about to leave’, ‘conversations with customers downgrading their service’, ‘conversations with customers upgrading their service’.

“We adjust weights based on how frequently terms appear in a particular document set. Then we take those features, so the words that show up frequently in a certain set, and they will be good indicators,” he said. “These weights of different words is what the machine learning algorithm uses to do the classification.

“We can take our new input and push it through, so we train our machine learning algorithm to classify these calls, then push it through this classifier model so we can identify them.”

Several machine learning algorithms can be used for classification, but Sullivan said ensemble methods or a combination of algorithms, are effective.

“The idea there is you train a bunch of classifiers and then essentially they vote on it and you take the majority answer,” he said. “Heterogeneous ensemble algorithms use a combination of different kinds. So you might take a SVM [support vector machine], a perceptron, a nearest centroid, a naïve bayes [cluster classifier] and combine those results.

“They can give you better results because each of these algorithms have their own strengths and weaknesses. You want algorithms that complement each other – where one is weak the other is strong.”

This can also ensure quality and accuracy of a classifier against making too many wrong predictions, Sullivan said. A “rough” idea of what accuracy score to aim for in terms of precision and recall is 0.8 or above, he said. Below this, the model parameters need tweaking or feature selection and data quality need to be revisited.

Sullivan cautioned TF–IDF can be a “crude” analytics practice because it throws away a lot of information on syntax and words that are semantically similar, such as hotel and motel.

Next page: Named entity recognition and event extraction

Page Break

4. Named entity recognition

NER basically looks at recognising nouns and could be used to extract persons, organisations, geographic locations, dates, monetary amounts, or the like from text. This works by looking at the words surrounding them, Sullivan said.

“For example, a sentence might read: ‘Today Senator Rand Paul said X,Y and Z.’ The fact that the term Senator is there, followed by another word and another word, means they are probably names,” he said. “It learns that statistically by looking at enough examples, so you want to be able to train enough examples.”

For each word in a sentence, the technique lists properties such as upper case letters, lower case letters and punctuation that appears after a word. It also looks for patterns and things that indicate what kinds of words should follow on from other words.

“Statistically, we can do pattern recognition. This is really useful if you have your own specialised set of terms you want to look for and they are varied in structure and you are able to train it,” Sullivan said. “One of the best algorithms for this is called conditional random fields.”

Normalisation of words can also be useful with NER, Sullivan said. For instance, a document might switch between Senator and its abbreviation. To be consistent, and to make analysis easier and the words more recognisable, the abbreviations can be replaced with the full word Senator.

However, abbreviations can be notoriously ambiguous, making the task of normalisation difficult, Sullivan said. The whole exercise of using NER technique involves significant data preparation and training, which can be time consuming, he said.

“Stanford Core NLP, OpenNLP and Mallet are all open source. But if you can use a commercial tool, I would suggest doing it,” Sullivan advised. “It’s hard to train these things and it can be very time consuming.”

5. Event extraction

Event extraction is a step further than NER but harder to do, Sullivan said. It not only looks at nouns, or what is being talked about, but also what the relationship is between them and the kinds of inferences that can be made from incidents referred to in the text.

Say you might want to know which company is acquiring which other company. In this case, it’s not just the structure of the words but what we are saying, Sullivan said.

“Are we saying company A is about to acquire company B?” he asked. “Or it could be a particular politician made an announcement about their run for the presidency in 2016 on Twitter on a particular date."

Sullivan used the analogy of filling in a row on a relational table, where all the key pieces of information and how they relate to each other are revealed. Event extraction can assign roles to entities, assign subtypes and link to semantic data.

Alan Ritter and Sam Clark built tools for event extraction for Twitter. There are also biomedical tools for this such as Stanford Biomedical Event Parser and Turku BioNLP, as well as commercial tool AlchemyAPI, Sullivan said.