Big Data and Natural Language Processing – Part 1

We hope you enjoyed the introduction to this series, part 1 is below.

“The science that has been developed around the facts of language passed through three stages before finding its true and unique object. First something called “grammar” was studied. This study, initiated by the Greeks and continued mainly by the French, was based on logic. It lacked a scientific approach and was detached from language itself. Its only aim was to give rules for distinguishing between correct and incorrect forms; it was a normative discipline, far removed from actual observation, and its scope was limited.” -Ferdinand de Saussure

Language is dynamic – trying to create rules to capture the full scope of language (e.g. grammars) fails because of how rapidly language changes. Instead, it is much easier to learn from as many examples as possible and guess the likelihood of the meaning of language; this, after all, is what humans do. Therefore Natural Language Processing and Computational Linguistics are stochastic methodologies, and a subset of artificial intelligence that benefits from Machine Learning techniques. 

Machine Learning has many flavors, and most of them attempt to get at the long tail — e.g. the low frequency events where the most relevant analysis occurs. To capture these events without resorting to some sort of comprehensive smoothing, more data is required, indeed the more data the better. I have yet to observe a machine learning discipline that complained of having too much data. (Generally speaking they complain of having too much modeling — overfit). Therefore the stochastic approach of NLP needs Big Data. 

NLP of Big Data using NLTK and Hadoop6

The flip side of the coin is not as straightforward. We know there are many massive natural language data sets on the web and elsewhere. Consider tweets, reviews, job listings, emails, etc. These data sets fulfil the three V’s of Big Data: velocity, variety, and volume. But do these data sets require comprehensive natural language processing to produce interesting data products?

NLP of Big Data using NLTK and Hadoop7

The answer is not yet. Hadoop and other tools already have build in text processing support. There are many approaches being applied to these data sets, particularly inverted indices, co-location scores, even N-Gram modeling. However, these approaches are not true NLP — they are simply search. They leverage string and lightweight syntactic analysis to perform frequency analyses.

NLP of Big Data using NLTK and Hadoop8

We have not yet exhausted all opportunities to perform these frequency analyses — many interesting results, particularly in clustering, classification and authorship forensics, have been explored. However, these approaches will soon start to fail to produce the more interesting results that users are coming to expect. Products like machine translation, sentence generation, text summarization, and more meaningful text recommendation will require strong semantic methodologies, and eventually Big Data will come to require NLP, it’s just not there yet.

The following two tabs change content below.

Benjamin Bengfort

Chief Data Scientist at Cobrain
Benjamin is a data scientist with a passion for massive machine learning involving gigantic natural language corpora, and has been leveraging that passion to develop a keen understanding of recommendation algorithms at Cobrain in Bethesda, MD where he serves as the Chief Data Scientist. With a professional background in military and intelligence, and an academic background in economics and computer science, he brings a unique set of skills and insights to his work. Ben believes that data is a currency that can pave the way to discovering insights and solve complex problems. He is also currently pursuing a PhD in Computer Science at the University of Maryland.

Latest posts by Benjamin Bengfort (see all)

This entry was posted in Data Science DC, Events, Reviews, Tutorials and tagged , , , , , , . Bookmark the permalink.

One Pingback/Trackback