A Better Definition of Big Data

As I have noted, Big Data is called that because of the problems it causes – it’s too big for this or that kind of processing or usefulness (ever tried looking through a million rows of data for something interesting?) The conventional definition of big data is slightly tautological: data that is big (or “too big”) in volume, velocity or variety for the current generation of hands-on data tools.

The kinds of data driving this big data wave are, as I hinted previously, high granularity data from systems and sensors recording behavior of (and inside) environments, people, markets and machines. This is in contrast to previous generations of data processing, designed to record and analyze the results of behavior, such as events and transactions, but not the behavior itself. In this era of big data, we have reached a new limit – something never usually considered a limit at all. That limit is Moore’s law.

Moore's law is the observation that computer power (for a given price) doubles approximately every two years. For a long time, Moore’s law would ensure regular general-purpose computers would improve fast enough to keep the databases humming with no need for (more expensive) special purpose products. 

Then something happened. IT departments in large corporations started experiencing problems that outstripped the general improvements in computer power being delivered by Moore’s law. New technologies became commercial successes because these IT departments started to spend money on solving such problems. In 1996 TimesTen spun out of HP with a new approach to solve one relational database performance problem and became very successful in its particular corner of the market. Just as TimesTen was becoming successful, Boston-based Netezza launched its own very successful product that was, effectively, a special purpose computer for another key part of the relational database market.

In retrospect, these commercial successes heralded the start of the big data era. They illustrated, with the power of the market (including a great IPO for Netezza, and later acquisitions of both), that data growth was outstripping Moore’s law and new approaches were in demand. Both these companies relied on new arrangements of hardware (with proprietary software) to achieve new levels of performance. However, the big data wave quickly swung round to clever new software designs and algorithms, and clever new ways to parcel out problems to lots of regular computers working in parallel. Along with new hardware, new software approaches are just as powerful in coping with the big data that is outstripping Moore’s law. These include column stores (e.g., Vertica), Google’s famous map reduce algorithm (e.g., Hadoop) and now many more.

All this leads to my definition of big data:

data with velocity, volume and/or variety growing faster than Moore’s law

As a footnote, the recent announcements I covered in May, headlined by Intel’s massive grant to MIT related to big data research, brings this full circle. Gordon Moore coined Moore’s law just a few years before co-founding Intel.

1 comment:

Doug Laney said...

Cool idea Richard. Since I first articulated the 3Vs over 11 years ago in a Gartner (then META) research note (link below), it's been great to see them catch on. Similar to you we realized there's more to a practicable defn of Big Data than just growth in volume-velocity-variety, so our recently updated defn acknowledges that: "Big Data are high-volume, high-velocity and high-variety information assets that require new innovative forms of processing for enhanced decision making, insight discovery and process optimization." A mouthful, but it recognizes that merely scaling won't handle it and that there are distinct purposes for it. Still, I like the tightness of your definition! --Doug Laney, VP Research, Gartner, @doug_laney

Original "3Vs" piece from 2001: