Big Data, “Mass” Data

2012-05-30_14-33-39_78Just a few days after I started my Big Data blog series, yesterday was Big Data in Massachusetts day. I was fortunate to be in the crowd at the Stata Center when MIT, Intel and the Commonwealth of Massachusetts launched a whole series of Big Data initiatives in Boston and Massachusetts. The MassTLC has a nice summary blog post here.

As a region, we are staking a claim to be the place for big data innovation. I think this is a well-founded claim and I think Boston will rise to the opportunity.

My first big data post talked about the rise of behavioral data as the driver of big data. I also alluded to systems being observed in finer detail, and instrumented in real time. A broader look at this deluge includes some of these factors not necessarily all based on behavior, although finer granularity really does elucidate system behavior much more readily. One example is data generated by genetic sequencing and other life science research such as high throughput screening. Another example is medical imaging, where image file sizes are now huge, because of resolution improvements and massive number of frames (rather than that single xray) per study. This crosses over into behavioral data of living systems when you think of those hi-res videos of a beating heart, or a digital “movie” of radiology guided surgery. In another industry, I was told by someone working in oil and gas seismology that similar digital imaging technology is used on drill cores and each cubic centimeter produces hundreds of gigabytes of image data. A drill core is 45 meters long, and apparently the total amount of data for a single core can reach up to an exabyte– talk about big data!

On a final note, I received an email from my brother-in-law who said he read the blog post yesterday and had never heard of big data before. He went on:

I didn't understand what you meant when you wrote “expect to read or hear about it 3 more times in the next few days", but as I write this I guess you're referring to the predictive ability of big data. Anyway, having never heard the term I was just reading up on <another company>, when bingo came across “big data overview“.

What is Big Data?

This is the first post of a series on Big Data. Watch for more!

The success of information technology until now has been built on our ability to comprehensively record transactions, events, or changes of state. We have made great use of this transactional data, optimizing inventory, streamlining processes, automating activity. Now, however, we can track and record the behavior that leads up to, or follows from, these transactions. People use computers, phones, the internet to do more and more. Each click and call is recorded and makes up a web of behavioral patterns. Computers are used for designing, making, selling, buying, trading – each step along the way is recorded and makes up a web of behavioral patterns. Markets are now computerized and each bid, each offer, each trade is recorded. Each item of news about a market, a company, a financial asset is also recorded and cross-correlated to the market activity. Systems are observed in finer detail, and instrumented in real time, not just when a transaction occurs or a state changes. By analyzing all this behavior we hope to be able to diagnose, to predict, to intervene; we hope to sell more, or price better, to make more efficiently, to diagnose disease and design treatments. We want this behavioral data because it promises to unlock value commensurate with its volume, velocity and variety, and this behavioral data, is, um, big.

This data is so big, in fact, that it is causing problems in the technology world. That’s why it has this name: big data. You might not have heard this term until now, but now you have read it here, expect to read or hear about it three more times in the next few days. What exactly is big data? All the definitions seem based on a notion that the problems of size make it noteworthy. Wikipedia offers “In information technology, big data is a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools.”

This doesn’t tell us why we have so much data or really why we should care. That is why I started this series on Big Data with this observation: we used to track transactions, and now we are tracking behavior.

Make no mistake, Big Data is about behavior – of people, systems, markets and machines.

At some stage, technology companies will solve the problems that make this data hard to ingest, handle, process, analyze, understand. It will no longer be big data, because it won’t be too big to manage.

However, without any doubt, behavioral data is here to stay!