About a year ago, an article in the Harvard Business Review called data scientists the sexiest job of the 21st century. Its authors, Tom Davenport and D. J. Patil, succinctly defined data scientist as “a high-ranking professional with the training and curiosity to make discoveries in the world of big data.”
For over ten years, we’ve been using the term big data to refer to the fast growing volumes and varieties of digital data being collected, much of it in real-time. Over the years, we’ve made considerable progress in the storage and management of these big data sets, helped along by major advances in computer science, math, statistics and related disciplines.
But all that progress is of limited value without people with the skill sets to extract important insights out of all that data. The promise of data science is that big data will lead to significantly better decisions and predictions, to the smart management of social organizations like cities, companies and economies, and to research breakthroughs in the social sciences, medicine and a number of other disciplines. And, on that front, most everyone agrees that people with the necessary skills are scarce. Demand for data scientists has raced ahead of supply, primarily because there have been few university programs training students in this emerging discipline.
But, the situation is rapidly changing. A recent NY Times article observed that “Universities can hardly turn out data scientists fast enough. . . Because data science is so new, universities are scrambling to define it and develop curriculums. As an academic field, it cuts across disciplines, with courses in statistics, analytics, computer science and math, coupled with the specialty a student wants to analyze, from patterns in marine life to historical texts.”
In general, universities are developing two main kind of programs. Some have a stronger theoretical focus, and are more oriented toward teaching methods, tools, algorithms and other such foundational skills. Others, while still requiring a technical background, are more applied in nature, focusing on the application of data science in industry, government and various academic disciplines.
This Fall, NYU is launching two new data science master programs along each of two these lines. The new NYU Center for Data Science is offering a 2 year MS in Data Science requiring a strong background in math, computer science and statistics. The second program is the MS in Applied Urban Science and Informatics being offered by NYU’s Center for Urban Science and Progress (CUSP). I am closely involved with this programs since for the past year I’ve been associated with CUSP as executive-in-residence.
CUSP is one of three new research and educational institutions created by Applied Science NYC, an initiative launched by mayor Michael Bloomberg in December of 2010 to significantly enhance NY City’s competitiveness and job creation in the applied sciences and engineering. On August 26, mayor Bloomberg personally welcomed CUSP’s inaugural class.
The one year, three-semester program includes core courses in urban science and in informatics-related technical disciplines. In addition, it offers specializations in three key areas: urban domains like transportation, public health and sustainability; big-data-oriented skills like data management and curation, decision models and optimization and simulation and computational methods; and entrepreneurship and innovation leadership. Given my industry background and past teaching experiences, I will be personally involved in this last area of specialization.
The emergence of a new discipline is very exciting indeed, especially when there are now not only research programs to organize but students to teach. While research is more open-ended, classes and degrees have concrete timelines. Developing the new courses raises lots of questions whose answers we will learn over time. What knowledge and skills should someone with an MS in applied urban science and informatics have? What does it mean to be an applied data scientist?
Let me offer some personal thoughts on this last question. I think the key here is to put the emphasis on the science aspects of data science. Scientific disciplines seek to develop testable explanations and predictions based on the application of well established scientific methods to their particular areas of research: “To be termed scientific, a method of inquiry must be based on empirical and measurable evidence subject to specific principles of reasoning.”
For centuries, we’ve seen that every time a powerful new tool is developed, it make possible all kinds of new measurements and observations. These often enable scientific discoveries of all kinds, a few of which lead to scientific revolutions, transforming the way we think about the world around us. For example, about a year ago, CERN’s Large Hadron Collider led to the discovery of the Higgs Boson. NASA’s Kepler telescope has led to the discovery of a fairly large number of Earth-like planets orbiting other stars. Closer to home, advances in DNA sequencing promise to revolutionize biology and medicine.
With big data, we have essentially turned our measuring instruments on ourselves. We’ve been using the ubiquitous digital technologies and devices all around us to both create and collect massive amounts of information on who we are, what we do and how we interact as individuals, communities and institutions. And, as has long been the case with physics, biology and other mature disciplines, we are now aiming to leverage all these sources of data to develop new scientific methods of inquiry in emerging data-science-oriented applications like urban informatics, social science research, information-based medicine and business processes of all sorts.
Will data science actually become a widespread new discipline, or will its research and educational programs become part of existing, established departments? We don’t know enough yet, as we are just starting to develop the needed research and education programs. Perhaps a reasonable guide post is to look at the advent of computer science. While much work had gone on with digital computers in the 1940s and 1950s, computer science came into its own in the 1960s. Like data science, computer science also has its roots in a number of disciplines, primarily math for its more theoretical foundations, and engineering for its more applied aspects. Arguably, computer science became an established, widely accepted discipline around the mid 1970s, and expanded in multiple directions with the advent of personal computers in the 1980s and the Internet in the 1990s.
It’s too early to tell whether data science will similarly become a distinct discipline, - with its own research agenda and educational programs that will train future generations of data scientists. But, regardless of which way it evolves, we need to carefully nurture it along and give it time to develop.
In particular, for data science to become a serious discipline we must carefully manage expectations and avoid unrealistic promises and hype. This is good advice in general, but especially in a field that has been called the sexiest job of the 21st century. It’s OK to have aspirational visions, but don’t underestimate how much hard work there is to do and how long it will take to realize the visions. Let results speak for themselves and learn from experience. Study not only those cases that succeed, but also those that do not but may yield valuable lessons learned. Serious disciplines take time to come into their own.
One of the data scientists I most admire is Nate Silver. In the 2012 presidential election, Silver correctly predicted the winner in all 50 states, including all nine highly contested swing states. He also correctly predicted the winner in 31of the 33 Senate races. As this article evaluating his predictions pointed out: “Forty-eight out of 50 states actually fell within his margin of error, giving him a success rate of 96 percent. And assuming that his projected margin of error figures represent 95 percent confidence intervals, which it is likely they did, Silver performed just about exactly as well as he would expect to over 50 trials.”
He’s been making successful predictions for over ten years, first in sabermetrics, - the use of statistics in baseball to project a player’s performance and career, - and since 2007 in political forecasting. Given his track record, even the gods on Mount Olympus would forgive him some hubris now and then. But Silver is truly a data scientist. This is evident throughout his September, 2012 book The Signal and the Noise: Why Most Predictions Fail but Some Don't. In the book, Silver explains not only his own particular approach to information-based predictions, but examines the growing field of predictions and why so many fail in spite of, or perhaps because of the vast quantities of information we now have available. He writes in the introductory chapter:
“The exponential growth in information is sometime seen as a cure-all, as computers were in the 1970s. Chris Anderson, the editor of Wired magazine, wrote in 2008 that the sheer volume of data would obviate the need for theory, and even the scientific method. This is an emphatically pro-science and pro-technology book, and I think of it as a very optimistic one. But it argues that these views are badly mistaken. The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning. . . Data-driven predictions can succeed - but they can fail. It is when we deny our role in the process that the odds of failure rises. Before we demand more of our data, we need to demand more of ourselves.”
He describes a number of areas where information-based predictions have been successful, including baseball, political elections and the forecasting of hurricanes. “But,” he warns, “these cases of progress in forecasting must be weighted against a series of failures.” These include our inability to see the September 11 attacks coming as well as our inability to predict the recent global financial crisis. “There are entire disciplines in which predictions have been failing, often at great cost to society.”
We have a long way to go. But the arrival of our inaugural class marks a big step, not only for NYU’s Center for Urban Science and Progress (CUSP) but for data science in general. The real learning is now starting, not only for the students but for this emerging, exciting discipline.
Is America, the world ready for “big data”? If the recent events playing out over the NSA disclosures is an example of readiness. We have failed the test.
I realize that the science of big data collection and analysis is a complex one in and of itself. But the profession, the use of enabling technology to collect, peruse and analyze the data will be inhibited if not dealt a death blow unless and until the ethics, the legality, the privacy issues are on the table, debated and resolved. I have little doubt that we can build the technologies, that we can learn and teach the skills required of the science, and the data scientist, but will we be able to satisfy the legitimate privacy concerns of a skeptical public? The jury is out on these thorny issues.
Posted by: Bud Byrd | September 11, 2013 at 09:39 PM