The Rise of Data Products

by 

I had the great opportunity to present at the kick-off event for the Mid-Maryland Data Science Meetup on “The Rise of Data Products”. Below is the talk captured in images and text.

Update: You can also download the audio here and follow along.

 

Slide01

Tonight’s talk is focused on capturing what I see as a new (or continuing) Gold Rush and could not be more excited about.

Before we can talk about the Rise of Data products, we need to define a Data Product.  Hilary Mason provides the following definition: “a data product is a product that is based on the combination of data and algorithms.”  To flesh this definition out a bit, here are some examples.

1) LinkedIn has a well known data science team and highlighted below is one such data product – a vanity metric indicating how many times you have appeared in searches and how many times people have viewed your profile. While some may argue that this is more of a data feature than product, I am sure it drives revenue as you have to pay to find out who is viewing your profile.

Slide04

2) Google’s search is the consumate data product. Take one part giant web index (data) and add it to the page rank algorithm (the algorithms) and you have a ubiquitous data product.

Slide05

3) Last, and not least, is Hipmunk. This company allows users to search flight data and visualize the results in an easy to understand fashion. Additionally, Hipmunk attempts to quantify the pain entailed by different flights (those 3 layovers add up) into an “agony” metric.

 

Slide06

So let’s try a slightly different definition – a data product is the combination of data and algorithms that creates value–social, financial, and/or environmental in nature–for one or more individuals.

Slide07

One can argue that data products have been around for some time and I would completely agree. However, the point of this talk is why are they exploding now?

Slide08

 I would argue that it is all about supply and demand. And, for this brief 15 minute talk (a distillation of a much longer talk), I am going to constrain the data product supply issue to the availability and cost of the tools required to explore data and the infrastructure required to deliver data products. On the demand side, I am going to do a “proof by example,” complete with much arm waving, to show that today’s mass market consumers want data.

Slide09

On the demand side, let’s start with something humans have been doing ever since they came down from the trees: running.

With a small sensor embedded in the shoe (not the only way these days), Nike+ collects detailed information about runners and simply cannot give enough data back to its customers. In terms of this specific success as evidence of general data product demand, Nike+ users have logged over 2 billion miles as of 1/29/2013.

Slide10

As further evidence of mass market data desire, 23and Me has convinced nearly a quarter million people to spit into a little plastic cup, seal it up, mail it off, and get their DNA sequenced. 23and Me then gives back the data to the user in the form of a genetic profile, complete with relative genetic disease risks and clear/detailed explanations of those numbers.

Slide11

And finally is Google Maps or GPS in general .. merging complex GIS data with sophisticated algorithms to compute optimal pathing and estimated time of arrival. Who doesn’t use this data product?

Slide12

In closing, the case for overwhelming data product demand is strong ::insert waving arms::: and made stronger by the fact that our very language has become sprinkled with quasi stat/math terms.  Who would ever think that pre-teens would talk about something trending?

Slide13

Let’s talk about the supply side of the equation now, starting with the tools required to explore data.

Slide14

Then: Everyone’s “favorite” old-school tool, Excel, costs a few hundred dollars depending on many factors.

Now: Google docs has a spreadsheet where 100 of your closest friends can simultaneously edit your data while you watch in real time.

And the cost, FREE.

Slide17

Let’s take a step past spreadsheets and rapidly prototype some custom algorithms using Matlab (Yes, some would do it in C but I would argue that most can do it faster in Matlab). The only problem here is that Matlab ain’t cheap. Beware when a login is required to get even individual license pricing.

Now, you have Python and a million different modules to support your data diving and scientific needs. Or, for the really adventurous, you can jump to the very forward looking, wickedly-fast, big-data ready, Julia. If a scientific/numeric programming language can be sexy, it would be Julia.

And the cost, FREE.

Slide20

Let’s just say you want to work with data frames with some hardcore statistical analyses. For a number of years, you have had SAS, Stata, and SPSS but these tools come at an extremely high cost. Now, you have R. And its FREE.

Slide23

Yes, an amazing set of robust and flexible tools for exploring data and prototyping data products can now be had for the low, low price of free, which is a radical departure from the days of yore.

Now that you have an amazing result from using your free tools, it is time to tell the world.

Back in the day (think Copernicus and Galileo), you would write a letter containing your amazing results (your data product) which would then take a few months to arrive to a colleague (your market). This was not a scalable infrastructure.

Slide24

Contemporary researchers push their findings out through the twisted world of peer-reviewed publications … where the content producers (researchers) often have to pay to get published while someone else makes money off of the work. Curious. More troubling is the fact that these articles are static.

Now, if you want to reach a global audience, you can pick up a CMS like WordPress or a web framework such as Rails or Django and build an interactive application.  Oh yeah, these tools are free.

Slide27

So the tools are free and now the question of infrastructure must be addressed. And before we hit infrastructure, I need to at least mention that over used buzz word, “big data.”

In terms of data products, “big data” is interesting for at least the simple reason that having more data increases the odds of having something valuable to at least someone.

Think of it this way, if Google only indexed a handful of pages, “Google” would never have become the verb that it is today.

Slide28

If you noticed the pattern of tools getting cheaper, we see the exact same trend with data stores. Whether your choice is relational or NOSQL, big or little-data, you can have your pick for FREE.

Slide31 Slide34

With data stores available for the low cost of nothing, we need actual computers to run everything. Traditionally, one bought servers which cost an arm and a leg and don’t forget siting requirements and maintenance among other costs. Now Amazon’s EC2 and Google Compute Engine allow you to spin up a cluster of 100 instances in a few minutes. Even better, with Heroku, sitting on top of Amazon, you can stand up any number of different data stores in minutes.

Slide37

 

Why should you be excited? Because the entire tool set and the infrastructure required to build and offer world-changing data products is now either free or incredibly low cost.

Let me put it another way. Imagine if Ford started giving away car factories, complete with all required car parts, to anyone with the time to make cars!!!!!

Slide38

Luckily, there are such individuals who will put this free factory to work. These “data scientists” understand the entire data science stack or pipeline. They can by themselves take raw data to a product ready to be consumed globally  (or at least make a pretty impressive prototype). While these individuals are relatively rare now, this state will change. Such an opportunity will draw a flood of individuals, and that rate will only increase as the tools become simpler to use.

Slide39

Let’s make the excitement a bit more personal and go back to that company with a lovable logo, Hipmunk.

If I remember the story correctly, two guys at the end of 2010 taught themselves Ruby On Rails and built what would become the Hipmunk we know and love today.

Two guys.

Three months.

Learned to Code.

And, by the way, Hipmunk has $20.2 million in funding 2 years later!

Slide40

It is a great time to work with data.

Slide41 Slide42

The following two tabs change content below.

Sean Murphy

Senior Scientist and Data Science Consultant at JHU
Sean Patrick Murphy, with degrees in math, electrical engineering, and biomedical engineering and an MBA from Oxford, has served as a senior scientist at Johns Hopkins University for over a decade, advises several startups, and provides learning analytics consulting for EverFi. Previously, he served as the Chief Data Scientist at a series A funded health care analytics firm, and the Director of Research at a boutique graduate educational company. He has also cofounded a big data startup and Data Community DC, a 2,000 member organization of data professionals. Find him on LinkedIn, Twitter, and .
This entry was posted in Data Science MD, Events, Reviews and tagged , , , . Bookmark the permalink.

One Pingback/Trackback