Dr. Gautam Shroff
VP & Chief Scientist, TCS Innovation Labs
The term ‘big data‘ has generated a lot of attention in the past eighteen months or so, to the point that it has been overused and oversold at times. Clearly a candidate for the peak of the ‘hype cycle’. Many people ask me whether their data is ‘big enough’ to qualify as ‘big data’. Are petabytes a must or will terabytes or even gigabytes qualify?
I tell them that this is the wrong question to ask. Two different examples serve to illustrate: First, consider basic census data about all 7 billion people; is this ‘big’? Well, with minimal effort it will fit in memory on most high-end servers. So is it big? No? Well, try loading it into a traditional database – I bet it takes at more than a day to merely get in. Oh, so it is big after all …. Well, not so fast. A C program can process all this data and calculate say, the median age for each gender that runs in minutes. So its not big ..?
Second example: Think of a few hundred individuals along with a small sample of their genetic information, which might be a few hundred thousand features per person. Big? Not in size – a few dozen megabytes at best. But try to slice and dice this data using a traditional OLAP tool. Many lifetimes are not enough to view all slices.
Lessons? First, traditional technology makes small amounts of data appear big for no reason. So new technology is needed. Second, even small data sets that are ‘wide’ appear big when it comes to analysis. So statistics, machine learning and data mining must be used rather than traditional slice and dice.
Big data is about counting, not queries. Also having ‘wide’ data, rather than lots of data, make for a ‘big data’ problem.