Bad news for anybody who claims that larger amounts of data improve the performance of
LLM-based systems. The converse is true; Smaller, specialized amounts of data produce
better results for questions in the same domain.
In any case, hybrid systems that use symbolic methods for evaluating results are
preferable to pure LLM-based techniques.
Some excerpts below from
www.newscientist.com/article/2449427-ais-get-worse-at-answering-simple-ques…
.
John
____________________
AIs get worse at answering simple questions as they get bigger
Using more training data and computational power is meant to make AIs more reliable, but
tests suggest large language models actually get less reliable as they grow.
AI developers try to improve the power of LLMs in two main ways: scaling up – giving them
more training data and more computational power – and shaping up, or fine-tuning them in
response to human feedback.
José Hernández-Orallo at the Polytechnic University of Valencia, Spain, and his colleagues
examined the performance of LLMs as they scaled up and shaped up. They looked at OpenAI’s
GPT series of chatbots, Meta’s LLaMA AI models, and BLOOM, developed by a group of
researchers called BigScience.
The researchers tested the AIs by posing five types of task: arithmetic problems, solving
anagrams, geographical questions, scientific challenges and pulling out information from
disorganised lists.
They found that scaling up and shaping up can make LLMs better at answering tricky
questions, such as rearranging the anagram “yoiirtsrphaepmdhray” into
“hyperparathyroidism”. But this isn’t matched by improvement on basic questions, such as
“what do you get when you add together 24427 and 7120”, which the LLMs continue to get
wrong.
While their performance on difficult questions got better, the likelihood that an AI
system would avoid answering any one question – because it couldn’t – dropped. As a
result, the likelihood of an incorrect answer rose.
The results highlight the dangers of presenting AIs as omniscient, as their creators often
do, says Hernández-Orallo – and which some users are too ready to believe. “We have an
overreliance on these systems,” he says. “We rely on and we trust them more than we
should.”