An Overview of Language Processing

Natural Language Processing or NLP is a tool that is used for organizing data in ways so that Artificial Intelligence systems can process that data. NLP uses artificial intelligence for reading data and extracting important information. It is one of the avenues that show maximum potential for processing social media data. But it is a huge challenge to create powerful algorithms that can extract valuable information from massive volumes of data that is collected from various sources and languages in different formats. The Walt Disney Company is one such MNC that has used NLP for extracting data from social media.

How can language processing help in evaluating social media data?

This company has more than 200,000 workers, and like most other companies, have an interest in knowing how their own customers feel about their products and services. This information has been obtained through phones, emails, in-person surveys and mail-in surveys, customer surveys, questionnaires, etc. The exercise is not cheap at all and demands a lot of time to create, distribute, collect and then evaluate the data. This is where NLP can play a role; Disney wanted to use language processing tools to understand public opinion about the brand in real time.

To begin with, one would have to collect the data as text; this is best done through platforms like Twitter. Using a script for this purpose it was possible to download all tweets containing the hashtag “Disney” once these were posted online. Along with the tweet text you can get volumes of data or metadata relating to the tweet or users who may have tweeted and re-tweeted the status. All the tweets were then stored in the cloud for a month. Then the Disney tweets were trimmed down to 320,000 from 500,000 using only those in English and run through Text Blob, a Python library.

The Disney brand includes many other brands like Lucas Films, Marvel, ABC, ESPN, Disney Parks and Resorts, etc. The idea was to get knowledge of the specific intellectual properties/brands.  This is when clustering, data cleaning and vectorization become important. The problem in working with tweets is duplicate tweets and retweets. So, the symbols, URLs, and retweet tags were eliminated from the tweets to get a unique set of data. Finally, the amount of tweets thus cleaned came down to about 98,000.

Input for any algorithm is numbers and there are multiple ways to convert data into numbers. You can turn a tweet containing words into counts for every word that is used in the tweet; this is known as vectorization or tokenization. A token could refer to a unique word or combination of letter or words together and when you vectorize the text, you basically assign values for every token. With 98,000 tweets the total number of counts for every text data point is going to be huge.

For an average person, looking at the table of 98,000 rows of tweets, it is very hard to decide whether the tweets are different or alike since they all appear the same. The idea is to make the dataset easier to read because having almost 10,000 columns might be accurate, but challenging for analysis. To reduce these columns SVD or Singular Value Decomposition was used. This takes all the information from tweets and distills it into some new columns. The process in quite complex and involves linear algebra. The main idea is to take all that information and compress it into a few dozen columns. The distilled nature now makes it difficult to interpret the data. This is why a clustering algorithm is used. This will assign every data point with a cluster number in order to make sure that tweets having similar text are clustered together.