Data Community DC Video Series Kicks Off: Dr. Jesse English Talks NLP and Text Processing

We are excited to announce the first in a new series of posts and a brand new initiative: Data Community DC Videos! We are going to film and publish online videos (and separate audio, resources permitting) as many talks from Data Community DC meetups as possible. Yes, we want you to experience the events in person, but realize that not everyone who wants to be a part of our community can attend every single event.

To kick this off, we have a fantastic video of Dr. Jesse English passionately discussing a brand new, open source framework, WIMs (Weakly Inferred Meanings), a novel approach to creating structured meaning representations for semantic analyses. Whereas a TMR (text meaning representation) requires a large, domain-specific knowledge base and significant computation times, WIMs cover a limited scope of possible relationships. The limitation is intentional, and allows for better performance– but still carries enough relationships for most applications. Additionally, the creation of a bespoke knowledge base and microtheory is not required, the novel pattern matching technique means that available ontologies like WordNet provide enough coverage. WIMs are Open Source and available now, and are truly a break through in semantic processing.
Continue reading

Posted in Community, Data Science MD, Python, Tutorials, Videos | Tagged , , | Leave a comment

Evidence from Google IO: Recommendation Engines are not MVPs

My co-editor’s earlier post today about recommendation engines is simply spot on and I wanted to add not only my strong agreement but also some more anecdotal support for her conclusions.

Google IO 2013, which concluded last week, was filled with developer-oriented announcements. As a result, some of the more consumer-focused announcements and their ramifications were glossed over. Google just announced that they are now recommending books and apps and musics via Google Play.  In other words, Google just launched their own recommendation engine. Bringing this point up to a few Googlers I was told that there have been quite a few teams at Google that have attempted to build such a recommendation engine before and met with less than stellar success.  And, remember, this is Google. There have been 48 billion app installs. They are indexing the world’s data. They have knowledge graph. They probably have your email. Yet, with more data than anyone else on the planet, ridiculous computing super-infrastructure, and immense pools of elite talent, Google Play is just now getting its own recommendation engine in 2013. Recommendation engines are NOT minimum viable products, full stop.

Posted in Micro, Rant | Leave a comment

Why You Should Not Build a Recommendation Engine

One does not simply build an MVP with a recommendation engine

Recommendation engines are arguably one of the trendiest uses of data science in startups today. How many new apps have you heard of that claim to “learn your tastes”? However, recommendations engines are widely misunderstood both in terms of what is involved in building a one as well as what problems they actually solve. A true recommender system involves some fairly hefty data science — it’s not something you can build by simply installing a plugin without writing code. With the exception of very rare cases, it is not the killer feature of your minimum viable product (MVP) that will make users flock to you — especially since there are so many fake and poorly performing recommender systems out there.

Continue reading

Posted in Commentary, Data Science DC, Methods, Rant | Tagged , , , | 1 Comment

Weekly Round-Up: Google’s Quantum Computer, Data Science vs. Statistics & BI, Business Computing, and Detecting Terrorism Networks

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from Google’s new quantum computer to detecting terrorist networks.

In this week’s round-up:

  • Google Buys a Quantum Computer
  • Statistics vs. Data Science vs. BI
  • Could Business Computing Be Done by Users Without Technical Experience?
  • Can Math Models Be Used to Detect Terrorism Networks?

Continue reading

Posted in Round-Ups | Tagged , , , | Leave a comment

A Revolution in Cloud Pricing: Minute By Minute Cloud Billing for Everyone

Google IO wrapped up last week with a tremendous number of data-related announcements. Today’s post is going to focus on Google Compute Engine (GCE), Google’s answer to Amazon’s Elastic Compute Cloud (EC2) that allows you to create and run virtual compute instances within Google’s cloud. We have spent a good amount of time talking about GCE in the past, in particular, benchmarking it against EC2 here, here, here, and here.clock

The main GCE announcement at IO was, of course, the fact that now **anyone** and **everyone** can try out and use GCE. Yes, GCE instances now support up to 10 terabytes per disk volume, which is a BIG deal. However, the fact that GCE will use minute-by-minute pricing, which might not seem incredibly significant on the surface, is an absolute game changer.
Continue reading

Posted in Commentary | Tagged , , , , | Leave a comment

Beyond Preprocessing – Weakly Inferred Meanings – Part 5

Congrats! This is the final post in our 6 part series! Just in case you have missed any parts, click through to the introductionpart 1part 2, part 3, and part 4.

NLP of Big Data using NLTK and Hadoop31

After you have treebanks, then what? The answer is that syntactic guessing is not the final frontier of NLP, we must go beyond to something more semantic. The idea is to determine the meaning of text in a machine tractable way by creating a TMR, a text-meaning representation (or thematic meaning representation). This, however, is not a trivial task, and now you’re at the frontier of the science.

Continue reading

Posted in Data Science DC, Events, Resources, Tutorials | Tagged , , , , , , | Leave a comment

Hadoop for Preprocessing Language – Part 4

We are glad that you have stuck around for this long and, just in case you have missed any parts, click through to the introductionpart 1part 2, and part 3.

NLP of Big Data using NLTK and Hadoop21

You might ask me, doesn’t Hadoop do text processing extremely well? After all, the first Hadoop jobs we learn are word count and inverted index!

Continue reading

Posted in Data Science DC, Events, Reviews, Tutorials | Tagged , , , , , , | Leave a comment

Python’s Natural Language Took Kit (NLTK) and Hadoop – Part 3

Welcome back to part 3 of Ben’s talk about Big Data and Natural Language Processing. (Click through to see the intro, part 1, and part 2).

NLP of Big Data using NLTK and Hadoop12

Continue reading

Posted in Data Science DC, Events, Reviews, Tutorials | Tagged , , , , , , | Leave a comment

Weekly Round-Up: Open Data Order, Data Discovery, Andrew Ng, and Connected Devices

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from Open Data to connected devices.

In this week’s round-up:

  • Open Data Order Could Save Lives, Energy Costs And Make Cool Apps
  • Four Types of Discovery Technology
  • Andrew Ng and the Quest for the New AI
  • Our Connected Future

Continue reading

Posted in Round-Ups | Tagged , , , , | Leave a comment

Event Review: Analyzing Twitter: An End-to-End Data Pipeline Recap

Data Science MD once again returned to the wonderful Ad.com in Baltimore to discuss Analyzing Twitter: An End-to-End Data Pipeline with two experts from Cloudera.

Continue reading

Posted in Data Science MD | Tagged , , , , | Leave a comment