AI show signs of cognitive decline similar to humans

Nelson Saliu

10 hours ago

A recent study reveals that nearly all leading large language models, or “chatbots,” display signs of mild cognitive impairment in tests designed to detect early dementia. The findings also indicate that “older” chatbot versions, similar to older patients, perform worse on these tests. The authors argue that these results challenge the belief that artificial intelligence will soon replace human doctors.

All of the leading chatbots show signs of mild cognitive impairment on dementia tests, with older versions performing worse, mirroring patterns seen in aging humans.

The results “challenge the assumption that artificial intelligence (AI) will soon replace human doctors,” according to the authors of the study published in The BMJ, a peer-reviewed medical journal.

While chatbots have excelled in medical diagnostics, fueling speculation that AI might eventually surpass human physicians, their vulnerability to human-like impairments, such as cognitive decline, had not been examined until now.

Researchers assessed the cognitive abilities of the most popular publicly available large language models (LLMs), including OpenAI’s GPT-4 and GPT 4.0, which power ChatGPT, Anthropic’s Claude 3.5, and Alphabet’s Gemini versions 1.0 and 1.5.

They found that almost all showed signs of mild cognitive impairment, while older versions, like older patients, tended to perform worse on the tests.

The models were examined using the Montreal Cognitive Assessment (MoCA) test. This test detects cognitive impairment and early signs of dementia, usually in older adults.

Through a number of short tasks and questions, it assesses abilities including attention, memory, language, visuospatial skills, and executive functions. The maximum score is 30 points, with a score of 26 or above generally considered normal.

The examined LLMs were given the same instructions for each task as those given to human patients. Scoring followed official guidelines and was evaluated by a practicing neurologist.

GPT-4o achieved the highest score, 26 out of 30, followed by GPT-4 and Claude with 25 out of 30 each. Gemini 1.0 scored the lowest, with 16 out of 30.

According to the study, all chatbots performed poorly in visuospatial skills and executive tasks, such as the trail-making task, which involves connecting encircled numbers and letters in ascending order.

They also struggled with the clock-drawing test, which requires drawing a clock face showing a specific time, while Gemini models additionally failed the delayed recall task, which tests memory of a five-word sequence.

Most other tasks, including naming, attention, language, and abstraction, were performed well by all chatbots.

While these are observational findings, and the authors acknowledge key differences between the human brain and large language models, they note that the “uniform” failure in tasks requiring visual abstraction and executive function highlights a significant weakness for chatbot use in clinical settings.

“Not only are neurologists unlikely to be replaced by large language models any time soon, but our findings suggest that they may soon find themselves treating new, virtual patients – artificial intelligence models presenting with cognitive impairment,” the authors said.