by Sean Murphy
I had the great opportunity to present at the kick-off event for the Mid-Maryland Data Science Meetup on “The Rise of Data Products”. Below is the talk captured in images and text.
Update: You can also download the audio here and follow along.
Tonight’s talk is focused on capturing what I see as a new (or continuing) Gold Rush and could not be more excited about.
Before we can talk about the Rise of Data products, we need to define a Data Product. Hilary Mason provides the following definition: “a data product is a product that is based on the combination of data and algorithms.” To flesh this definition out a bit, here are some examples.
1) LinkedIn has a well known data science team and highlighted below is one such data product – a vanity metric indicating how many times you have appeared in searches and how many times people have viewed your profile. While some may argue that this is more of a data feature than product, I am sure it drives revenue as you have to pay to find out who is viewing your profile.
2) Google’s search is the consumate data product. Take one part giant web index (data) and add it to the page rank algorithm (the algorithms) and you have a ubiquitous data product.
3) Last, and not least, is Hipmunk. This company allows users to search flight data and visualize the results in an easy to understand fashion. Additionally, Hipmunk attempts to quantify the pain entailed by different flights (those 3 layovers add up) into an “agony” metric.
So let’s try a slightly different definition – a data product is the combination of data and algorithms that creates value–social, financial, and/or environmental in nature–for one or more individuals.
One can argue that data products have been around for some time and I would completely agree. However, the point of this talk is why are they exploding now?
I would argue that it is all about supply and demand. And, for this brief 15 minute talk (a distillation of a much longer talk), I am going to constrain the data product supply issue to the availability and cost of the tools required to explore data and the infrastructure required to deliver data products. On the demand side, I am going to do a “proof by example,” complete with much arm waving, to show that today’s mass market consumers want data.
On the demand side, let’s start with something humans have been doing ever since they came down from the trees: running.
With a small sensor embedded in the shoe (not the only way these days), Nike+ collects detailed information about runners and simply cannot give enough data back to its customers. In terms of this specific success as evidence of general data product demand, Nike+ users have logged over 2 billion miles as of 1/29/2013.
As further evidence of mass market data desire, 23and Me has convinced nearly a quarter million people to spit into a little plastic cup, seal it up, mail it off, and get their DNA sequenced. 23and Me then gives back the data to the user in the form of a genetic profile, complete with relative genetic disease risks and clear/detailed explanations of those numbers.
And finally is Google Maps or GPS in general .. merging complex GIS data with sophisticated algorithms to compute optimal pathing and estimated time of arrival. Who doesn’t use this data product?
In closing, the case for overwhelming data product demand is strong ::insert waving arms::: and made stronger by the fact that our very language has become sprinkled with quasi stat/math terms. Who would ever think that pre-teens would talk about something trending?
Let’s talk about the supply side of the equation now, starting with the tools required to explore data.
Then: Everyone’s “favorite” old-school tool, Excel, costs a few hundred dollars depending on many factors.
Now: Google docs has a spreadsheet where 100 of your closest friends can simultaneously edit your data while you watch in real time.
And the cost, FREE.
Let’s take a step past spreadsheets and rapidly prototype some custom algorithms using Matlab (Yes, some would do it in C but I would argue that most can do it faster in Matlab). The only problem here is that Matlab ain’t cheap. Beware when a login is required to get even individual license pricing.
Now, you have Python and a million different modules to support your data diving and scientific needs. Or, for the really adventurous, you can jump to the very forward looking, wickedly-fast, big-data ready, Julia. If a scientific/numeric programming language can be sexy, it would be Julia.
And the cost, FREE.
Let’s just say you want to work with data frames with some hardcore statistical analyses. For a number of years, you have had SAS, Stata, and SPSS but these tools come at an extremely high cost. Now, you have R. And its FREE.
Yes, an amazing set of robust and flexible tools for exploring data and prototyping data products can now be had for the low, low price of free, which is a radical departure from the days of yore.
Now that you have an amazing result from using your free tools, it is time to tell the world.
Back in the day (think Copernicus and Galileo), you would write a letter containing your amazing results (your data product) which would then take a few months to arrive to a colleague (your market). This was not a scalable infrastructure.
Contemporary researchers push their findings out through the twisted world of peer-reviewed publications … where the content producers (researchers) often have to pay to get published while someone else makes money off of the work. Curious. More troubling is the fact that these articles are static.
Now, if you want to reach a global audience, you can pick up a CMS like WordPress or a web framework such as Rails or Django and build an interactive application. Oh yeah, these tools are free.
So the tools are free and now the question of infrastructure must be addressed. And before we hit infrastructure, I need to at least mention that over used buzz word, “big data.”
In terms of data products, “big data” is interesting for at least the simple reason that having more data increases the odds of having something valuable to at least someone.
Think of it this way, if Google only indexed a handful of pages, “Google” would never have become the verb that it is today.
If you noticed the pattern of tools getting cheaper, we see the exact same trend with data stores. Whether your choice is relational or NOSQL, big or little-data, you can have your pick for FREE.
With data stores available for the low cost of nothing, we need actual computers to run everything. Traditionally, one bought servers which cost an arm and a leg and don’t forget siting requirements and maintenance among other costs. Now Amazon’s EC2 and Google Compute Engine allow you to spin up a cluster of 100 instances in a few minutes. Even better, with Heroku, sitting on top of Amazon, you can stand up any number of different data stores in minutes.
Why should you be excited? Because the entire tool set and the infrastructure required to build and offer world-changing data products is now either free or incredibly low cost.
Let me put it another way. Imagine if Ford started giving away car factories, complete with all required car parts, to anyone with the time to make cars!!!!!
Luckily, there are such individuals who will put this free factory to work. These “data scientists” understand the entire data science stack or pipeline. They can by themselves take raw data to a product ready to be consumed globally (or at least make a pretty impressive prototype). While these individuals are relatively rare now, this state will change. Such an opportunity will draw a flood of individuals, and that rate will only increase as the tools become simpler to use.
Let’s make the excitement a bit more personal and go back to that company with a lovable logo, Hipmunk.
If I remember the story correctly, two guys at the end of 2010 taught themselves Ruby On Rails and built what would become the Hipmunk we know and love today.
Learned to Code.
And, by the way, Hipmunk has $20.2 million in funding 2 years later!
It is a great time to work with data.
Latest posts by Sean Murphy (see all)
- Flask Mega Meta Tutorial for Data Scientists - February 16, 2014
- Expanding the Online Presence of Data Community DC – W3DC’s Strategic Plan for 2014 - January 6, 2014
- A Tutorial for Deploying a Django Application that Uses Numpy and Scipy to Google Compute Engine Using Apache2 and modwsgi - December 17, 2013