“Mistaking the impressive engineering achievements of LLMs for the mastering of human language, language understanding, and linguistic acts has dire implications for various forms of social participation, human agency, justice and policies surrounding them,” wrote cognitive scientists Abeba Birhane and Marek McGann in a recent paper, “Large models of what? Mistaking engineering achievements for human linguistic agency.” “Hyperbolic claims surrounding LLMs often (mis)uses terms that are naturally applied to the experiences, capabilities, and characteristics of human beings.”
LLMs impressive abilities to respond to, and generate natural language are based on their training using massive language data sets generally sourced from the World Wide Web. The training involves breaking down text or speech into tokens, typically a few characters in length, to develop a statistical model of language. Powerful statistical techniques and lots of computational power are then used to analyze the relationship between billions of tokens in order to generate grammatically valid sequences of token concatenations in response to a question or prompt.
Birhane and McGann note that “the processing of datasets and the generation of output are engineering problems, word prediction or sequence extension grounded in the underlying distribution of previously processed text. The generated text need not necessarily adhere to ‘facts’ in the real world,” which is why LLMs are prone to hallucinations, a response generated by its algorithms that, while statistically and grammatically correct, is actually false or misleading real-world information.