A new study published in The BMJ suggests that the assumption that artificial intelligence will soon replace human doctors may be premature.
Researchers found that most large language models (LLMs), or chatbots, display signs of mild cognitive impairment when subjected to diagnostic tests commonly used to detect early signs of dementia.
Interestingly, the study also revealed that “older” versions of chatbots tend to perform worse on these tests, akin to how aging impacts human cognitive abilities. These findings raise important questions about the capabilities and limitations of AI in medical practice.
While AI systems have demonstrated impressive diagnostic skills in various medical tasks, their vulnerability to human-like impairments such as cognitive decline has not been widely explored.
This study offers a fresh perspective, challenging the belief that AI could fully replace human physicians anytime soon.
To explore the cognitive abilities of leading large language models (LLMs), researchers administered the Montreal Cognitive Assessment (MoCA) test to some of the most advanced publicly available AI systems.
These included ChatGPT versions 4 and 4o from OpenAI, Claude 3.5 “Sonnet” from Anthropic, and Gemini versions 1 and 1.5 from Alphabet.
The MoCA test, commonly used to detect cognitive decline and early signs of dementia in older adults, evaluates various cognitive functions such as attention, memory, language, visuospatial skills, and executive functions.
The LLMs were given the same instructions as human patients, and their performance was scored according to official MoCA guidelines, with evaluation by a practicing neurologist.
The highest score, 26 out of 30, was achieved by ChatGPT 4o, followed closely by ChatGPT 4 and Claude 3.5 “Sonnet,” both scoring 25 out of 30. In contrast, Gemini 1.0 scored the lowest, with only 16 out of 30.
Across the board, all chatbots struggled with tasks involving visuospatial skills and executive functions, such as the trail-making task (connecting numbered and lettered circles in order) and the clock-drawing test (creating a clock face with a specified time).
The Gemini models, in particular, performed poorly on the delayed recall task, where they were asked to remember a five-word sequence.
While most tasks, such as naming, attention, language, and abstraction, were performed well by all chatbots, there were significant struggles in tasks requiring visuospatial skills.
The chatbots failed to demonstrate empathy or accurately interpret complex visual scenes. Notably, only ChatGPT 4o was able to succeed in the incongruent stage of the Stroop test, which measures the impact of interference on reaction time using color names and font colors.
The researchers emphasize that these are observational findings, acknowledging the fundamental differences between human brains and large language models.
However, they argue that the consistent failure of all the AI models in tasks involving visual abstraction and executive functions reveals a critical weakness, potentially limiting their application in clinical environments.
As a result, the study concludes, “Not only are neurologists unlikely to be replaced by large language models anytime soon, but our findings suggest they may soon find themselves diagnosing a new type of patient—artificial intelligence models exhibiting cognitive impairment.”
Other Stories You May Like