The data-centric AI era started about 25 years ago when the explosive growth of the internet led to what’s become known as big data, that is, the availability of huge amounts of digital data, including text, voice, and images. The 2010s saw major innovations in multi-layered deep learning algorithms, followed in the past few years by the advent of foundation models, including generative AI, large language models, and chatbots.
Given the explosive growth of interest in AI, the demand for data keeps increasing. No matter what an organization hopes to achieve, success is impossible without ready access to high-quality data. But, as data becomes increasingly critical for business success, how are enterprises adjusting to the increasing importance of data? Where is all the data needed to train AI models going to come from? And what are the potential legal and ethical issues that companies need to watch out for?
To explore these important questions, I moderated a panel on The Data Provenance Dilemma at the 2024 MIT Sloan CIO Symposium on May 14. “Generative AI models are highly dependent on the quality and quantity of data used in their training. But, many of these models are being trained on vast, diverse, and inconsistently documented datasets that have been raising serious concerns about the legal and ethical risks involved,” notes the brief for the panel in the Symposium agenda. “The panel will discuss the complexities of using large, diverse datasets in the training of generative AI models, the difficulties involved in understanding the provenance of the data being used, and the kinds of tools and standards needed for the responsible use of these powerful models.”
The panel included Mike Mason, chief AI officer of the technology consultancy Thoughtworks; Shayne Longpre, a doctorate candidate at the MIT Media Lab; and Robert Mahari, who’s pursuing a joint JD-PhD degree at the Harvard Law School and MIT’s Media Lab. Let me summarize the key points we discussed in the panel.
Continue reading "The Emerging Data-Centric AI Era: Where Will All the Data Come From?" »