I recently participated in an industry panel on Redefining BI & Analytics in the Cloud at a conference in New York. Among other questions, moderator Michael Hickins, - a senior editor at the Wall Street Journal and editor of its online CIO Journal, - asked the panel to discuss the difference between the business intelligence and analytics approaches of the past twenty years and the emerging discipline of data science.
In my opinion, data science should be viewed as a multidisciplinary evolution from business intelligence and analytics. In addition to having a solid foundation in statistics, math, data engineering and computer science, data scientists must also have expertise in some particular industry or business domain, so they can properly identify the important problems to solve in a given area and the kinds of answers one should be looking for. Domain expertise is also needed to be able to draw the proper conclusions from their analysis, and to communicate their findings to business leaders in their own terms.
The emergence of data science is closely intertwined with the explosive growth of big data over the past several years. Institutions are now wrestling with information coming at them in volumes and varieties never encountered before. In addition to their multidisciplinary skills, data scientists bring increased breadth and depth to the analysis of these different sources of information compared to traditional analyst roles, as described in this short article in IBM’s website, So what does a data scientist do?:
“Whereas a traditional data analyst may look only at data from a single source - a CRM system, for example - a data scientist will most likely explore and examine data from multiple disparate sources. The data scientist will sift through all incoming data with the goal of discovering a previously hidden insight, which in turn can provide a competitive advantage or address a pressing business problem. A data scientist does not simply collect and report on data, but also looks at it from many angles, determines what it means, then recommends ways to apply the data.”
One of the problems with conducting an in-depth, comprehensive information analysis is that the multiple data sets that are typically required have often been locked away within organizational silos, - be they different lines of business in a company, different companies in an industry or different institutions across society at large. Data science holds the potential to let us address complex problems by working with, linking together and analyzing data sets previously locked away in disparate silos.
This could help financial institutions, for example, to better assess their risks and potentially extend loans to individuals and businesses that would not have otherwise qualified. It could enable health care practitioners to better identify individuals who are most at risk to develop diabetes and start taking precautionary actions.
This ability to work across data sets and silos could help us get early clues to hard-to-predict, high-impact black swan events, so we can dig deeper into these clues and assess their validity. When experts investigate catastrophic black swan events, be they airline crashes, financial crises, or terrorist attacks, they often find that we failed to anticipate them even when the needed information was present because the data was spread across different organizations and was never properly brought together.
The difference between analyzing individual data sets and the data science approach of looking across interrelated data sets reminds me of related questions in the study of complex systems. In science and engineering, it has long been known that you get different results when looking at a complex problem as a system of coupled systems compared to just modeling its individual components in isolation. In a complex system, it is often the intricate interrelationships between its components that accounts for its unpredictable, emergent behavior.
Simple, linear systems, such as those found in classical mechanics, are much easier for us to deal with. The elegant mathematical models of Newtonian physics depict a world in which objects exhibit deterministic behaviors, that is, the same objects, subject to the same forces, will always yield the same results. There is a direct, predictable relationship between inputs and outputs, actions and consequences. We identify the problem, gather data, evaluate alternatives, select a solution and proceed to implement. Most of the decisions and actions in our everyday life are based on such linear models.
But, these simple models are of limited value when dealing with dynamic, complex systems, such as you find in quantum mechanics, evolutionary biology, and aerospace engineering. Dynamic, complex problems are increasingly arising in the study of sociotechnical systems, that is, systems which combine powerful digital technologies with the people and organizations they are transforming, as is the case in health care, finance, cities and law enforcement.
We make the wrong decisions and get in trouble when there is a large gap between the complexity of the real problems we are trying to address and our simple models of the problem. Even highly educated, experienced and accomplished leaders in business, government and academia are often surprised by the unanticipated, negative (sometimes disastrously so) consequences of their actions and decisions.
“Complexity hinders our ability to discover the delayed and distal impacts of interventions, generating unintended side effects. Yet learning often fails even when strong evidence is available: common mental models lead to erroneous but self-confirming inferences, allowing harmful beliefs and behaviors to persist and undermining implementation of beneficial policies.”
He believes that unanticipated events and side effects are not features of reality in complex systems, but a result of overly simplistic, incomplete models:
“We have been trained to view our situation as the result of forces outside ourselves, forces largely unpredictable and uncontrollable. Consider the unanticipated events and side effects so often invoked to explain policy failure. Political leaders blame recession on corporate fraud or terrorism. Managers blame bankruptcy on events outside their organizations and (they want us to believe) outside their control. But there are no side effects - just effects. Those we expected or that prove beneficial we call the main effects and claim credit. Those that undercut our policies and cause harm we claim to be side effects, hoping to excuse the failure of our intervention. Side effects are not a feature of reality, but a sign that the boundaries of our mental models are too narrow, our time horizons too short.”
Getting back to the difference between traditional analytic approaches and data science, I think that over time we will learn that their differences are similar to those between simple linear systems and dynamic complex ones. Analyzing a complex problem one data set at a time will result in an overly simplistic, incomplete model that will likely miss the non-linear side effects resulting from the interrelationships of disparate data sets.
Over the past century, our ability to deal with complex dynamic systems has been critical to major advances in science, engineering and a number of other disciplines. Given the oceans of data we now have access to, we are counting on the emerging discipline of data science to help us extract the valuable insights buried deep inside all that data and thus help us address all kind of important problems.