Natural Language Processing and Big Data: Using NLTK and Hadoop – Talk Overview

NLP of Big Data using NLTK and Hadoop1

My previous startup, Unbound Concepts, created a machine learning algorithm that determined the textual complexity (e.g. reading level) of children’s literature. Our approach started as a natural language processing problem — designed to pull out language features to train our algorithms, and then quickly became a big data problem when we realized how much literature we had to go through in order to come up with meaningful representations. We chose to combine NLTK and Hadoop to create our Big Data NLP architecture, and we learned some useful lessons along the way. This series of posts is based on a talk done at the April Data Science DC meetup.

Think of this post as the Cliff Notes of the talk and the upcoming series of posts so you don’t have to read every word … but trust me, it’s worth it.

Related to the interaction between Big Data and NLP:

  • Natural Language Processing needs Big Data
  • Big Data doesn’t need NLP… yet.

Related to using Hadoop and NLTK:

  • The combination of NLTK and Hadoop is perfect for prepossessing raw text
  • More semantic analysis tend to be graph problems that Map Reduce isn’t great at computing.

About data products in general:

  • The foo of Big Data is the ability to take domain knowledge and a data set (or sets) and iterate quickly through hypotheses using available tools (NLP)
  • The magic of big data is that there is currently a surplus of both data and knowledge and our tools are working, so it’s easy to come up with a data product (until demand meets supply).

I’ll go over each of these points in detail as I did in my presentation, so stay tuned for the longer version [editor: so long that it has been broken up into multiple posts]



The following two tabs change content below.

Benjamin Bengfort

Chief Data Scientist at Cobrain
Benjamin is a data scientist with a passion for massive machine learning involving gigantic natural language corpora, and has been leveraging that passion to develop a keen understanding of recommendation algorithms at Cobrain in Bethesda, MD where he serves as the Chief Data Scientist. With a professional background in military and intelligence, and an academic background in economics and computer science, he brings a unique set of skills and insights to his work. Ben believes that data is a currency that can pave the way to discovering insights and solve complex problems. He is also currently pursuing a PhD in Computer Science at the University of Maryland.

Latest posts by Benjamin Bengfort (see all)

This entry was posted in Data Science DC, Events, Reviews, Tutorials and tagged , , , , , , . Bookmark the permalink.