One of the key findings of the 2022 AI Index Report was that large language models (LLMs) are setting records on technical benchmarks thanks to advances in deep neural networks and computational power that allows them to be trained using huge amounts of data. LLMs are now surpassing human baselines in a number of complex language tasks, including English language understanding, text summarization, natural language inference, and machine translation.
A.I. Is Mastering Language. Should We Trust What It Says?, a recent NY Times Magazine article by science writer Steven Johnson, took a close look at one such LLM, the Generative Pre-Trained Transformer 3, generally referred to as GPT-3. GPT-3 was created by the AI research company OpenAI. It’s been trained with over 700 gigabytes of data from across the web, along with a large collection of text from digitized books. “Since GPT-3’s release, the internet has been awash with examples of the software’s eerie facility with language - along with its blind spots and foibles and other more sinister tendencies,” said Johnson.
“So far, the experiments with large language models have been mostly that: experiments probing the model for signs of true intelligence, exploring its creative uses, exposing its biases. But the ultimate commercial potential is enormous. If the existing trajectory continues, software like GPT-3 could revolutionize how we search for information in the next few years.” Instead of typing a few keywords into Google and getting back a long list of links that might have the answer, you’d ask GPT-3 what you’re looking for in English and get back a well-written, accurate answer. “Customer service could be utterly transformed: Any company with a product that currently requires a human tech-support team might be able to train an L.L.M. to replace them.”
The key concept underlying GPT-3 is next-word prediction, one we’re quite familiar with when typing a document, email or message and the autocomplete feature tries to predict the likely next words. But GPT-3 is not only able to predict the next words. It can generate whole sentences and paragraphs in the style of the original text. Shortly after GPT-3 went online in 2020, “the OpenAI team discovered that the neural net had developed surprisingly effective skills at writing computer software, even though the training data had not deliberately included examples of code. It turned out that the web is filled with countless pages that include examples of computer programming, accompanied by descriptions of what the code is designed to do; from those elemental clues, GPT-3 effectively taught itself how to program.” GPT-3 can already generate legal documents, like licensing agreements or leases, and can similarly do so in any field that involves structured documents.
Impressive as GPT-3 is, it’s capabilities are statistical and mechanistic. “A.I. has a long history of creating the illusion of intelligence or understanding without actually delivering the goods.” Early achievements, like inference engines and expert systems, led researchers to conclude that machines would achieve human-level intelligence within a couple of decades. But this early optimism collapsed leading to so-called AI winters in the 1970s and 1980s. The current bout of enthusiasm is the biggest yet, given achievements like Google’s AlphaGo unexpectedly defeating one of the world’s top Go players in 2016 and the remarkable performance of GPT-3 and other large language models. Some worry that if another bout of inflated expectations is followed by a disillusionment with AI’s practical limits, some kind of AI autumn could follow.
“It seemed almost impossible that a machine could generate text so lucid and responsive based entirely on the elemental training of next-word-prediction,” wrote Johnson. “How can we determine whether GPT-3 is actually generating its own ideas or merely paraphrasing the syntax of language it has scanned from the servers of Wikipedia, or Oberlin College, or The New York Review of Books?”
“This is not just an esoteric debate,” he adds. “If, in fact, the large language models are already displaying some kind of emergent intelligence, it might even suggest a path forward toward true artificial general intelligence.” But if LLMs, and similar deep-learning based AI models end up promising more than they can deliver, “then AGI retreats once again to the distant horizon.”
The article raises a number of cautionary arguments against considering the achievements of LLMs as evidence of progress along the path to AGI. Let me briefly summarize three of these arguments.
LLMs are just stochastic parrots
The term stochastic parrot was coined in a provocative recent paper, On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?, by Emily Bender, Timnit Gebru and two co-authors. Their paper argued that LLMs were merely remixing the enormous number of human-authored sentences used in their training. Their impressive ability to generate cogent, articulate sentences gives us the illusion that we’re dealing with a well educated and intelligent human, rather than with a stochastic parrot that has no human-like understanding of the ideas underlying the sentences it’s putting together.
Another important challenge with large deep learning systems is their black box nature. It’s quite difficult to explain in human terms why they choose one output over others. LLMs have huge numbers of parameters within their complex neural networks, making it very hard to assess the contributions of individual nodes to a decision in terms that a human will understand. “This is one reason the debate about large language models exists,” said Johnson. “Some people argue that higher-level understanding is emerging, thanks to the deep layers of the neural net. Others think the program by definition can’t get to true understanding simply by playing ‘guess the missing word’ all day. But no one really knows.”
Moreover, the larger the training data sets, the higher the probability of including racists, sexist, extremist and other harmful biases as well as overtly abusive language patterns and harmful ideologies. A major finding of the 2022 AI Index Report was that: “Large language models are setting new records on technical benchmarks, but new data shows that larger models are also more capable of reflecting biases from their training data. A 280 billion parameter model developed in 2021 shows a 29% increase in elicited toxicity over a 117 million parameter model considered the state of the art as of 2018.”
Beyond being accurate and articulate, there must be a process for adapting language models to society so that they’re trained to filter out biases and toxicity, much as we teach societal values to our children. “We’ve never had to teach values to our machines before,” noted Johnson.
Lack of common-sense knowledge
LLMs also lack the common-sense knowledge about the world that human intelligence relies upon. Johnson references a recent column by Melanie Mitchell where she wrote that “understanding language requires understanding the world, and a machine exposed only to language cannot gain such an understanding.”
Common-sense knowledge include the kind of cognitive abilities that our biological brains take for granted. While deep learning require large amounts of training data to perform at the level of humans, children can learn from a small number of examples. “A few storybook pictures can teach them not only about cats and dogs but jaguars and rhinos and unicorns,” wrote UC Berkeley professor Alison Gopney in a 2019 WSJ essay, The Ultimate Learning Machines. “One of the secrets of children’s learning is that they construct models or theories of the world. … even 1-year-old babies know a lot about objects: They are surprised if they see a toy car hover in midair or pass through a wall, even if they’ve never seen the car or the wall before.”
Can LLMs be trusted?
“The most heated debate about large language models does not revolve around the question of whether they can be trained to understand the world,” wrote Johnson. “Instead, it revolves around whether they can be trusted at all.”
Deep learning systems do best when analyzing data that closely resemble the data used in their training. But when attempting to generalize or extrapolate beyond, they can exhibit a kind of hallucination problem, being fooled by slight perturbations in their inputs that wouldn’t fool humans.
In addition, LLMs have even more troubling propensities, such as deploying openly racist language; spewing conspiratorial misinformation; and offering life-threatening advice to health or safety questions. “All those failures stem from one inescapable fact,” adds Johnson: “To get a large enough data set to make an L.L.M. work, you need to scrape the wider web. And the wider web is, sadly, a representative picture of our collective mental state as a species right now, which continues to be plagued by bias, misinformation and other toxins.”
“However the training problem is addressed in the years to come, GPT-3 and its peers have made one astonishing thing clear: The machines have acquired language,” wrote Johnson in conclusion. “If you spend enough time with GPT-3, conjuring new prompts to explore its capabilities and its failings, you end up feeling as if you are interacting with a kind of child prodigy whose brilliance is shadowed by some obvious limitations: capable of astonishing leaps of inference; possessing deep domain expertise in a vast range of fields, but shockingly clueless about many basic facts; prone to strange, senseless digressions; unencumbered by etiquette and social norms.”
“I don’t know if that complicated mix of qualities constitutes a ‘glimmer’ of general intelligence, but I do know that interacting with it is qualitatively different from any experience I’ve had with a machine before. The very premise that we are now having a serious debate over the best way to instill moral and civic values in our software should make it clear that we have crossed an important threshold.”
Comments