Toward A Better Definition of “Big Data”

While many have tried, the term “big data” lacks a true consensus definition. At the moment the most popular definitions seem to coalesce around the idea that big data is one or more data sets so large and complex that they are challenging to process using traditional databases and tools. Often associated with this concept are the characteristic “three V’s” of big data: the volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Some enterprising companies and consultants throw in a 4th “V” for veracity or some other “V” word.

Regardless, these definitions miss a key aspect of the term. To put it into hyperbolic language, “Big Data” isn’t about the size of data at all. Instead, it is the simple yet seemingly revolutionary belief that data is valuable.

While “big data” does often happen to be large in size (although this is always relative to the available tool set), I believe that “big” actually means important (think big deal). Scientists have long known that data could create new knowledge but now the rest of the world, including government and management in particular, has realized that data can create value, principally financial but also environmental and social value. And, if data is valuable, more data is more valuable and who doesn’t want “big” (ie large) value.

Thoughts? Tomatoes to throw? Take aim below in the comments section.

The following two tabs change content below.

Sean Murphy

Senior Scientist and Data Science Consultant at JHU
Sean Patrick Murphy, with degrees in math, electrical engineering, and biomedical engineering and an MBA from Oxford, has served as a senior scientist at Johns Hopkins University for over a decade, advises several startups, and provides learning analytics consulting for EverFi. Previously, he served as the Chief Data Scientist at a series A funded health care analytics firm, and the Director of Research at a boutique graduate educational company. He has also cofounded a big data startup and Data Community DC, a 2,000 member organization of data professionals. Find him on LinkedIn, Twitter, and .
This entry was posted in Commentary. Bookmark the permalink.
  • http://twitter.com/bbengfort Benjamin Bengfort

    I think that large as a definition of “big” should not be devalued in this discussion, but relegated to relativity– e.g. a second is large when you’re talking about GHz – so too is small data volumes when you’re talking about scarce data.

    Largeness is related to capturing the long tail in the frequency of events- and that’s where the most interesting stuff is. In machine learning methodologies we need “lots” (see large) of data, to have enough examples to be able to make probabilistic assumptions about the world.

    Otherwise, we have to create our own volume — usually through smoothing.

    Because we can do these things — it is a big deal.

  • Tony Ojeda

    If you use the term Big Data to describe the importance/value of data, then what would you use to describe data sets that are large enough to pose an engineering problem? I think the reason Big Data is such a nebulous term is because “big” can be interpreted in many different ways. Perhaps the answer is simply to use more descriptive words.

  • Jim Hancock

    Though I like the premise of this post I think it’s been known that data is important for a long time. In econometrics for example economists dream up models and try to determine if the data corroborates their theories …and have done so for decades. Most of this data is government data and extracted from surveys, businesses etc. though …and available to everyone.

    Big data though is businesses data and it is accumulated from direct human interaction (i.e. Google, Facebook, NetFlix, Amazon, etc.) …and it is much more granular. This adds many more possible dimensions to models and removes statistical issues such as heteroscedasticity (though adding significant complexity).

    This should allow certain businesses to reach a tipping point where their data becomes a distinct asset providing them with a large competitive advantage beyond their core business model, whether it be selling books, ads or TV shows. They will have a better view of the world than their competitors.

    Anyway, I don’t work in this space, so it’s just my 2 cents! Interesting post …thanks!

    • dominick

      Yes, lamentable name. Some time ago, I read a long article by the folks in the private research wing of microsoft that used the words “intensive data” but they were really talking about everything we consider as “Big Data” phenomenon today. The article dated from about 2006–probably before the wide usage of the term “Big Data.”

      I liked it as it doesn’t get stuck in the relativity of size to the toolset available i.e. today’s BIG data, could be very small tomorrow. “Intensive data” also works well as an adjective: data-intensive businesses; data-intensive science; data-intensive medicine.

      I like it. But I’ll bet the crowd at McKinsey didn’t.

      Thanks for an interesting post.

      dominick

      • datacommunitydc

        Love the feedback and completely agree that there are some circles, such as academia and research, that have long recognized the intrinsic value of data. I believe that the key change that the “big data” represents is that the value of data has now been generally recognized by not only the masses, but also the decision makers and those in power. It is the critical expectation now that data will be used to make things better, just as organizations must track money through proper accounting techniques. Data is becoming an intrinsic, dare I say “big,” part of everyone’s playbook.