The “Foo” of Big Data – Part 2

Welcome to Part 2 of this epic Big Data and Natural Language Processing perspective series. Here is the intro and part one if you missed any of them.

NLP of Big Data using NLTK and Hadoop9

Domain knowledge is incredibly important, particularly in the context of stochastic methodologies, and particularly in NLP. Not all language, text, or domains have the same requirements, and there is no way to make a universal model for it. Consider how the language of doctors and lawyers may be outside our experience in the language of computer science or data science. Even more generally, regions tends to have specialized dialects or phrases even within the same language. As an anthropological rule, groups tend to specialize particular language features to communicate more effectively, and attempting to capture all of these specializations leads to ambiguity.

This leads to an interesting hypothesis: the foo of big data is to combine specialized domain knowledge with an interesting data set.

NLP of Big Data using NLTK and Hadoop11

Further, given that domain knowledge and an interesting data set or sets:

  1. Form hypothesis (a possible data product)
  2. Mix in NLP techniques and machine learning tools
  3. Perform computation and test hypothesis
  4. Add to data set and domain knowledge
  5. Iterate

If this sounds like the scientific method, you’re right! This is why it’s cool to hire PhDs again; in the context of Big Data, science is needed to create products. We’re not building bridges, we’re testing hypotheses, and this is the future of business.

But this alone is not why Big Data is a buzzword. The magic of big data is that we can iterate through the foo extremely rapidly and  multiple hypotheses can be tested via distributed computing techniques in weeks instead of years or ever shorter time periods. There is currently a surplus of data and domain knowledge, so everyone is interested in getting their own piece of data real estate, that is, getting their domain knowledge and their data set. The demand is rising to meet the surplus, and as a result we’re making a lot of money. Money means buzzwords. This is the magic of Big Data!


The following two tabs change content below.

Benjamin Bengfort

Chief Data Scientist at Cobrain
Benjamin is a data scientist with a passion for massive machine learning involving gigantic natural language corpora, and has been leveraging that passion to develop a keen understanding of recommendation algorithms at Cobrain in Bethesda, MD where he serves as the Chief Data Scientist. With a professional background in military and intelligence, and an academic background in economics and computer science, he brings a unique set of skills and insights to his work. Ben believes that data is a currency that can pave the way to discovering insights and solve complex problems. He is also currently pursuing a PhD in Computer Science at the University of Maryland.

Latest posts by Benjamin Bengfort (see all)

This entry was posted in Data Science DC, Events, Reviews, Tutorials and tagged , , , , , , . Bookmark the permalink.