A few recent articles have expressed concerns that big data may be at the peak of inflated expectations in the so-called hype cycle for emerging technologies, and will soon start falling into the trough of disillusionment. This is not uncommon in the early stages of a disruptive technology. The key question is whether the technology will keep falling through the trough and be soon forgotten, or whether it will eventually move on toward the slope of enlightenment on its way to a long life in the plateau of productivity. How can you tell which it is going to be?
In my experience, a disruptive technology will succeed if it can keep attracting serious researchers and analysts, who will, over time, cut through the hype and bring discipline to its development and marketing, coming up with solutions to the many technical obstacles any new innovation encounters, sorting through its unrealistic promises and reframing the scope and timelines of its objectives. The Internet recovered from the hype that led to the dot-com bubble and has gone on to a highly successful future. Cloud computing is now going through a similar period of serious evaluation and development. So is big data.
In The Rise of Big Data: How It's Changing the Way We Think About the World, an article just published in Foreign Affairs, Economist editor Kenneth Cukier and Oxford professor Viktor Mayer-Schönberger do a very nice job in articulating why “big data marks the moment when the information society finally fulfills the promise implied by its name.” The article is adapted from their book Big Data: A Revolution That Will Transform How We Live, Work, and Think published in March, 2013.
Cukeir and Mayer-Schönberger explain that big data has risen rapidly to the center stage position it now occupies for the simple reason that there is so much more digital information now floating around than ever before. In 2000, only one-quarter of the world’s stored information was digital and therefore subject to search and analysis. Since then, the amount of digital data has been doubling roughly every three years, so by now only two percent of all stored information is not digital.
Big data could not have possibly come into being without the digital revolution which thanks to Moore’s Law has made it possible to drastically lower the costs of storing and analyzing those oceans of information. The Web has also made it much easier to collect data, as has the explosive growth of mobile devices and smart sensors. “But, at its heart,” the authors write, “big data is only the latest step in humanity’s quest to understand and quantify the world.” Datafication is the term they use to describe the ability to now capture as data many aspects of the world that have never been quantified before.
I totally agree with their view that big data should not only be framed as part of the digital and Internet revolution of the past few decades, but also as part of the scientific revolution of the past few centuries. At the 2013 MIT Sloan CIO Symposium this past May, MIT professor Erik Brynjolfsson made a similar point at the panel he moderated on The Reality of Big Data when he observed that throughout history, new tools beget revolutions.
Scientific revolutions are launched when new tools make possible all kinds of new measurements and observations. Early in the 17th century, Galileo made major improvements to the recently invented telescope which enabled him to make discoveries that radically changed our whole view of the universe. Over the centuries we’ve seen that new tools, measurements and discoveries precede major scientific breakthroughs in physics, chemistry, biology and other disciplines
Our new big data tools have the potential to usher an information-based scientific revolution. And just like the telescope, the microscope, spectrometers and DNA sequencers have led to the creation of new scientific disciplines, data science is now rapidly emerging as the academic companion to big data. One of the most exciting part of data science is that it can be applied to just about any domain of knowledge, given our newfound ability to gather valuable data on almost any topic, including healthcare, finance, management and the social sciences. But, like all scientific revolutions, this will take time.
According to Cukeir and Mayer-Schönberger, datafication requires three profound changes in how we deal with data. The first is what they call n=all, that is, collecting and using lots of data rather than settling for small samples, as statisticians have done until now. “The way people handled the problem of capturing information in the past was through sampling. When collecting data was costly and processing it was difficult and time consuming, the sample was a savior. Modern sampling is based on the idea that, within a certain margin of error, one can infer something about the total population from a small subset, as long the sample is chosen at random.”
Sampling requires anticipating how the data will be used so you can design the proper sample. It works well when asking questions about the overall sample, but not so well when you want to drill down into smaller subsets, since you are likely not to have enough data to do so effectively. Also, if you changed your mind about the insights you want from the data, you generally have to get a new sample. All these issues pretty much go away when you collect and store all the data rather than a sample, that is, when the sample size n=all.
The next change requires accepting messiness instead of insisting on clean, carefully curated data. “[In] an increasing number of situations, a bit of inaccuracy can be tolerated, because the benefits of using vastly more data of variable quality outweigh the costs of using smaller amounts of very exact data. . . When there was not that much data around, researchers had to make sure that the figures they bothered to collect were as exact as possible. Tapping vastly more data means that we can now allow some inaccuracies to slip in (provided the data set is not completely incorrect), in return for benefiting from the insights that a massive body of data provides.”
I find the last major change, causation to correlation, particularly intriguing. As the authors put it: “Big data helps answer what, not why, and often that’s good enough.” Or, at least it’s good enough in the early stages of an empirical science, when we are looking for patterns that will help us predict future events and behaviors without necessarily having a good model or theory of why they happen. The models and theories come later, but sometimes they do not come at all.
For example, MIT professor Dimitris Bertsimas was part of the the Reality of Big Data panel moderated by professor Brynjolfsson at the MIT CIO Symposium. He talked about his recent research analyzing decades of cancer treatment data in the hope of improving the life expectancy and quality of life of cancer patients at reasonable costs. Along with three of his students, he developed models for predicting survival and toxicity using patients’ demographic data as well as data on the chemotherapy drugs and dosages they were given. Their paper, An Analytics Approach to Designing Clinical Trials for Cancer, shows that it’s possible to predict future clinical trial outcomes based on past data, even if the exact combination of drugs being predicted has never been tested in a clinical trial before, and even if the reasons why this particular combination of drugs works is not understood.
“Using big data will sometimes mean forgoing the quest for why in return for knowing what. . . This represents a move away from always trying to understand the deeper reasons behind how the world works to simply learning about an association among phenomena and using that to get things done,” write Cukeir and Mayer-Schönberger. “Of course, knowing the causes behind things is desirable. The problem is that causes are often extremely hard to figure out, and many times, when we think we have identified them, it is nothing more than a self-congratulatory illusion. Behavioral economics has shown that humans are conditioned to see causes even where none exist. So we need to be particularly on guard to prevent our cognitive biases from deluding us; sometimes, we just have to let the data speak.”
“In a world where data shape decisions more and more, what purpose will remain for people, or for intuition, or for going against the facts?,” ask the authors in their concluding paragraphs. “If everyone appeals to the data and harnesses big-data tools, perhaps what will become the central point of differentiation is unpredictability: the human element of instinct, risk taking, accidents, and even error. If so, then there will be a special need to carve out a place for the human: to reserve space for intuition, common sense, and serendipity to ensure that they are not crowded out by data and machine-made answers. . . [H]owever dazzling the power of big data appears, its seductive glimmer must never blind us to its inherent imperfections. Rather, we must adopt this technology with an appreciation not just of its power but also of its limitations.”