Data Science is emerging as a hot new profession and academic discipline. Data Scientist: the Sexiest Job of the 21st Century is the title of a recent Harvard Business Review article. Its authors, Tom Davenport and D. J. Patil, define data scientist as “a high-ranking professional with the training and curiosity to make discoveries in the world of big data, . . . Their sudden appearance on the business scene reflects the fact that companies are now wrestling with information that comes in varieties and volumes never encountered before.” They note that demand for data scientists is racing ahead of supply. People with the necessary skills are scarce, primarily because the discipline is so new that there are no university programs offering data science degrees.
But, the situation is rapidly changing. A number of universities are setting up graduate programs in data science. In New York City alone, for example, NYU has launched a new Center for Data Science which will start offering a Master in Data Science in the Fall of this year. Urban informatics, - the application of data science to urban problems, - is the primary focus of NYU’s new Center for Urban Science and Progress, which will start a masters program in Applied Urban Science and Informatics this Fall as well. Columbia University is starting an Institute for Data Science and Engineering. Similar research and educational programs are being organized in universities around the world.
The emergence of data science is closely intertwined with the explosive growth of big data over the past decade. Davenport and Patil write that: “More than anything, what data scientists do is make discoveries while swimming in data. It’s their preferred method of navigating the world around them. At ease in the digital realm, they are able to bring structure to large quantities of formless data and make analysis possible. They identify rich data sources, join them with other, potentially incomplete data sources, and clean the resulting set. In a competitive landscape where challenges keep changing and data never stop flowing, data scientists help decision makers shift from ad hoc analysis to an ongoing conversation with data.”
Data science goes beyond the use of data mining, business analytics and statistical analysis to look for patterns in large data sets. It is more multidisciplinary in nature. According to Wikipedia: “Data science incorporates varying elements and builds on techniques and theories from many fields, including math, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products.”
Data Science: An Introduction, a wikibook being developed as a tutorial on the subject, describes data science as “a child born of the mature parental disciplines of scientific methods, data and software engineering, statistics, and visualization, . . . a mash-up of several different disciplines.”
The data part of data science comes from data engineering and computer science, which deal with acquiring, ingesting, transforming, storing and retrieving the vast volumes and varieties of data generally used in data science applications. Its science part seeks to extract insights from the data by applying tried-and-true scientific methods, that is, empirical and measurable evidence subject to principles of reasoning, as well as testable explanations and predictions.
The discipline requires considerable programming and computing knowledge, as well as visualization, so that the insights extracted from the data can be presented in a way that reinforces human cognition. Math and statistics provide its formal foundations.
Perhaps the most exciting part of data science is that it can be applied to just about any domain of knowledge, given our newfound ability to gather valuable data on almost any topic. However, doing so effectively requires domain expertise to identify the important problems to solve in a given area, the kinds of answers one should be looking for, and the best way to present whatever insights are discovered in a way that they can be best understood by domain practitioners in their own terms.
What is data science?, by Mike Loukides of O’reilly Media, is another good article on the subject. “Merely using data isn’t really what we mean by data science,” he writes. “[Data scientists] are inherently interdisciplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: ‘here’s a lot of data, what can you make from it?’”
According to experts Loukides interviewed for his article, the best data scientists tend to be physicists and other scientists. “Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data. They have to think about the big picture, the big problem. When you’ve just spent a lot of grant money generating data, you can’t just throw the data out if it isn’t as clean as you’d like. You have to make it tell its story. You need some creativity for when the story the data is telling isn’t what you think it’s telling.”
It’s very exciting to contemplate the emergence of a new discipline. It reminds me of the advent of computer science in the 1960s and 1970s. Computer science also has its roots in a number of disciplines, primarily math for its more theoretical foundations, and engineering for its more applied aspects. In its early days, it also attracted people from a variety of other disciplines who started out using computers in their work or studies, and eventually switched to computer science from their original field, as was the case with me. I used computers extensively while a physics graduate student at the University of Chicago, realized that I enjoyed computing more than physics, and subsequently joined the computer science department at IBM’s Watson Research Center.
Computer science has become an established disciplined. It has grown extensively since its early days and expanded in many new directions. It’s too early to tell whether data science will similarly become a distinct discipline, - with its own research agenda and educational programs that will train future generations of data scientists, - or whether over time it will be absorbed by its parent disciplines.
These various discussions of data science remind of a very good book I read a few years ago, Innovation - the Missing Dimension, by MIT professors Richard Lester and Michael Piore. The book explored the essence of innovation in new product development by examining a few truly novel products in different market areas. They concluded that innovation involves two fundamental processes: analysis and interpretation.
Analysis is essentially rational decision making and problem solving. It’s the standard approach underlying management and engineering practice It involves a relatively linear set of steps and works quite well when you are looking for a solution to a relatively well defined problem.
But where do the problems come from in the first place? How do you decide what problems to work on and try to solve? This second kind of innovation, - which they call interpretation - is very different in nature from analysis. You are not solving a problem but looking for a new insight about customers and the marketplace, a new idea for a product or a service, a new approach to producing and delivering them, a new business model. Their research showed that interpretive innovation generally takes place through a process of conversations among people and organizations with different backgrounds and perspectives, until the problems can be identified and clarified to the point where a solution can be developed.
How do you initiate these conversations and keep them going? Lester and Piore came up with an interesting metaphor to describe the process of interpretative innovation. They liken it to the role of a good host at a cocktail party “identifying the guests, bringing them to the party, suggesting who should talk to whom and what they might talk about, intervening as necessary to keep the conversations flowing, . . .”
One could view data science as going one step further. Each guest is invited to bring a friend to the party, namely their data. Now, the conversations leading to innovative new ideas are not just among the various guests, but will also include their various data sources. We can now have much deeper conversations, which will hopefully lead to more creative new ideas, as well as enabling us to better decide which of the ideas are worth pursuing.
In the end, such a process of data-driven creativity and innovation nicely captures the very essence of data science.
Comments