One way that Big Data experts such as Tom Davenport distinguish between the eras of ‘big’ and ’little’ data is on the type of data companies are using. Big Data is more associated with unstructured and external data. But what does this mean? While there are many ways to classify such data, the two most common are:
- The degree to which the data is ’structured’. Data that is numerical (financial, order, and other data) is regarded as structured – neatly able to fit in the columns and rows of modern database management software. ‘Unstructured’ data cannot so easily be compiled into older database formats. This data could be digital video, text (increasingly coming from comments on social media sites such as Twitter, Facebook and LinkedIn), digitized audio and other types. To analyze this data, the technology needs to process it in some manner. (’Sentiment analysis’ is a hot trend in how to treat social media data – e.g., determining people’s sentiments about a company and its products and practices.)
- Whether the data is ’internal’ or ’external’ data. Is data generated by the company or brought from the outside? For example, an increasing number of companies (particularly retailers and restaurant chains) are seeking external data from telecommunications firms that can track customers’ locations through their mobile devices. The value of this data to retailers is the ability to intercept potential customers who are in the vicinity of their stores with targeted marketing offers that may convince them to walk in.
Defining Types and Sources of Digital Data
In our research, we defined data along two dimensions: structured versus unstructured and internal versus external. Given below are the definitions we used.
On the dimension of data structure:
- Structured – Data that resides in fixed fields (for example, data in relational databases or in spreadsheets)
- Unstructured – Data that does not reside in fixed fields (for example, free-form text from articles, email messages, untagged audio and video data, etc.)
- Semi-structured – Data that does not reside in fixed fields but uses tags or other markers to capture elements of the data (for example, XML, HTML-tagged text)
On the dimension of data source:
- Internal – from a company’s sales, customer service, manufacturing, and employee records; from visits to the company’s website, etc.
- External – from sources outside a company such as third-party data providers, public social media sites such as Facebook, Twitter and Google+, etc.
Classifying Big Data along these two dimensions, we then wanted to know how much of companies’ data was structured versus unstructured, as well as how much was generated internally versus externally. We were surprised by the combined results across all four regions of the world that we surveyed:
- 51% of data is structured
- 27% of data is unstructured
- 21% of data is semi-structured
A much higher than anticipated percentage of data was not structured – either unstructured or ’semi-structured’ (when combined, about half ). (See Exhibit II-7)
Exhibit II-7: Percentage of Data that is Structured versus Unstructured
Q8: Mean Estimated Percentage of Structured, Unstructured and Semi-Structured data, across all of the Company’s Big Data Initiatives
And a little less than a quarter of the data was external. (See Exhibit II-8)
Exhibit II-8: Percentage of Data That is Internal versus External
Q9: Mean Estimated Percentage of Data that comes from Internal or External sources, across all of the Company’s Big Data Initiatives
North American companies had the highest percentage of structured data; Asia-Pacific companies had the most unstructured data. North American companies also had the highest percentage of internal data; Asia-Pacific companies had the lowest.
To discover new patterns in Big Data, companies need highly efficient ways to aggregate data across data warehouses and other data stores. Since most data in these stores is structured, it is far easier for analysts to explore it. It is also not difficult to create structured data out of semi-structured data such as web activity.
However, unstructured data (for example, free-form text, video, audio, and image data where context needs to be derived from the data) is hard to discern. The most sought-after data right now, text as natural language processing (NLP), can be used to derive context that is beyond the typical sentiment analysis. Nonetheless, some text data (particularly Twitter tweets) are fairly semi-structured. Hashtags give some sense of context, while mentions, retweets, and @’s provide references to people. Facebook posts, blog posts, and other free-form text are more difficult to analyze, as noted above. However, tags and other meta-data can help narrow down the context of a comment.
In the interviews that our research team conducted, many executives said their companies’ usage of unstructured data is not only increasing but is also becoming essential. “Studies have been done on electronic records that show, on average, 80%-90% or more of data in records is unstructured data,” one health care executive said. “That requires natural language processing to extract information.” He said much of the health care industry is trying to improve capturing and analysis of unstructured data such as images, emails, physician and nurses’ notes, etc.
Companies are increasingly looking to external data to get a fuller picture of activities that might affect them – particularly customer behavior. The soaring use of mobile devices now provides companies with data that, at least in theory, can help them track customer movements. This kind of external data is fully on the radar of global companies.
The head of Ford Motor Company’s analytics group, John Ginder, put it this way to one trade magazine: “We recognize that the volumes of data we generate internally … as well as the universe of data that our customers live in and that exists on the Internet … are huge opportunities for us that will likely require some new specialized techniques or platforms to manage.” Internet data that consumers provide appears to be of big interest. “The fundamental assumption of Big Data is the amount of that data is only going to grow and there’s an opportunity for us to combine that external data with our own internal data in new ways. For better forecasting or better insights into product design, there are many, many opportunities.”
Who is Selling Their Big (Digital) Data?
With companies capturing so much more digital data today to understand their operational performance moment-by-moment, the behavior of customers and suppliers, and other vital signs of the business, it’s begun to raise eyebrows of both opportunity and concern. Executives are seeking data the organization has that might be of value to another organization, and from which the firm might be able to profit. That’s the opportunity side.
In 2012, about one-quarter of the companies we surveyed (27%) were capitalizing on this opportunity: selling their digital data. U.S. companies profited least from such data, with only 22% doing so. In contrast, half the Asia-Pacific companies we polled said they sell their digital data. About one-quarter of European and Latin American companies sold their digital data in 2012. (See Exhibit II-9)
Exhibit II-9: Who’s Selling Their Digital Data?
Q10: Percentage of Companies that Sell their Digital Data?
For the approximately one-quarter of companies that sell their digital data, how lucrative is it? Our survey found that the annual revenue from selling such data was not trivial. In 2012, on an average, selling digital data contributed $21.6 million to the revenue of companies. (Exhibit II-10).
Exhibit II-10: How Much Money are Companies Generating From the Data They Sell?
Q13-a: Mean Annual Revenue Per Company in 2012 from Selling Digital Data
So clearly, some companies are profiting from their data, albeit a distinct minority today. However, of the 73% of companies that did not sell such data, 22% said they do plan to sell such data by 2015; 55% don’t; and 23% did not know. That means by 2015, 43% of companies will sell their digital data (the 27% that already do today, plus the 22% of the 73% that don’t today).