Learning from Leading Big Data Experts about the Technologies of Big Data
What companies are doing today with Big Data is not only limited by the acute shortage
of analytical manpower – that rare beast now known as the “data scientist” – they are also
constrained by the capabilities of today’s technologies in collecting, organizing, making
sense of, processing, and presenting digital data in its many forms.
Nonetheless, billions of dollars in venture capital have been flowing to Big Data start-ups. That funding can be found in nascent companies like Cloudera, Palantir, Mu Sigma, Opera Solutions, VoltDB, and 10gen. A number of large, established technology companies have also turned the spending spigot on Big Data wide open: IBM, Oracle, SAS, and SAP, to name just a few. Companies like these are investing deeply in Big Data technologies. Their initiatives are aimed at enabling their existing software and hardware to take on the industrial-strength duties of Big Data and analytics – and create new software and hardware.
Thus, the technology of Big Data is evolving rapidly. To get some insights into what the
technology makes possible today and what it may make possible in the near future, TCS
interviewed two leading pioneers of Big Data technologies: Joseph Hellerstein of the
University of California at Berkeley, and VS Subrahmanian of the University of Maryland.
Here are the highlights of those discussions.
“We’re in the Early Days of Big Data – Like the Early 1900s’ Era Before Washing Machines.”
Joseph Hellerstein, Chancellor’s Professor of Computer Science, UC Berkeley, EECS Computer Science Division
Joseph Hellerstein likens today’s times for Big Data to the early 1900s before the advent of the washing machine. (The first electric washing machines began appearing in the first decade of that century.) Back then, women spent an average 60 hours a week manually washing clothes.
Cleansing Big Data is in a similar state, Hellerstein believes. He and several colleagues interviewed 35 analysts in companies across industries. They told them they spent 60% to 80% of their time on data preparation. “We’re getting data from all over the place and it’s not prepared for analysis or to be integrated with other data and analysis tools,” he says. “The tools available are not designed for analysts.”
Hellerstein sees a big opportunity in bringing data cleansing into the modern-day
equivalent of the electric washing machine. He is founder and CEO of a data analysis
tools start-up called Trifacta.
“The Amount of Unstructured Data you will Need to have will be Vastly Larger than your Structured Data.”
VS Subrahmanian, Professor of Computer Science and Director, Center for Digital International Government, University of Maryland
Subrahmanian has done extensive research and has developed technology on databases, artificial intelligence and optimization methods to track and forecast behaviors of terrorist groups, socio-cultural groups, health care and other areas. Much of this data is unstructured. He believes such unstructured data will be equally important to business – in fact, more important in the future than structured data. “The amount of unstructured data you will need to have will be vastly larger than your structured data,” he says.
To make good decisions, managers will need both unstructured and structured data. But the problem today is that the accuracy with which software can make sense of unstructured data such as text is far lower than it is for structured data such as point of sale information. “People shoot for about 80% accuracy in text analytics. To go from 80% to 90% is a very steep curve,” he says. “And you can spend a lot of time and money in trying to get there, but you might not.”
Despite that, he believes the ability to make predictions on reading text with 80%
accuracy is very good – especially if such data comes from several sources about the
same phenomenon. For example, if a company knows a prospect likes jogging with
80% certainty, biking with 80% certainty, and soccer with 70% certainty, it can be pretty
confident that the person likes sports. “That person is a good prospect for Nike,” he says.
In the next 2-3 years, Subrahmanian sees companies having technology for developing
effective metrics on the effectiveness of social media, software that will tie it “somewhat
closely to ROI.” He says search engines already have good technology to discern the
impact of advertising on their own revenue. “They are less forthcoming in providing
companies with data on the impact of search engine advertising on their revenue.”
Nonetheless, he predicts this will change over the next two years. “By then, you will
have the ability to see the impact of search engine and social media marketing on your
Next Post »
Big Data Study: Implications and Recommendations