Как неправильные стимулы в обучении моделей ИИ ведут к ложным утверждениям Translation: How Misguided Incentives in AI Model Training Lead to False Claims

Опубликовано: 9 сентября, 2025

Language models experience hallucinations because traditional training and evaluation methods encourage guesses rather than acknowledging uncertainty. This is discussed in an OpenAI research paper.

The company provided the following definition of the issue:

“Hallucinations are plausible yet false statements generated by language models. They can manifest unexpectedly even when responding to seemingly straightforward questions.”

For instance, when researchers inquired about the title of Adam Tauman Kalai’s doctoral dissertation (the author of the article) from a “widely used chatbot,” it confidently produced three different answers, none of which were correct. When the AI was asked about its birthday, it provided three incorrect dates.

OpenAI believes that hallucinations persist in part because current evaluation methods set misleading incentives, prompting neural networks to «guess» the next character in a response.

An analogy was provided: consider a person who doesn’t know the correct answer to a test question but can guess and inadvertently select the right one.

“Imagine a language model is asked about someone’s birthday but does not know. If it guesses ‘September 10,’ the chances of being correct are one in 365. The response ‘I don’t know’ guarantees zero points. After thousands of test questions, a guess-based model looks better on the scoreboard than a careful one that allows uncertainty,” the researchers explained.

In terms of accuracy, OpenAI’s older model, o4-mini, performs slightly better; however, its error rate is significantly higher than that of GPT-5, as strategic guessing in uncertain situations improves accuracy but also increases hallucinations.

Language models are initially trained through «pre-training»—a process that predicts the next word in vast amounts of text. Unlike traditional machine learning tasks, there are no labels of “true/false” attached to each statement. The model only sees positive language examples and must approximate the overall distribution.

“It’s doubly challenging to distinguish correct statements from incorrect ones when there are no examples labeled as incorrect. Even with labels, errors are inevitable,” emphasized OpenAI.

The company provided another example. In image recognition, if millions of pictures of cats and dogs are labeled appropriately, algorithms will learn to classify them reliably. However, if every pet photo is categorized by its birthday, the task will always lead to errors, no matter how advanced the algorithm is.

The same applies to text—spelling and punctuation follow consistent patterns, so errors decrease as scale increases.

The researchers argue that merely introducing “a few new tests that consider uncertainty” is insufficient. Instead, “widely used accuracy-based evaluations need to be updated to exclude results that encourage guessing.”

“If the fundamental [evaluation] scales continue to reward successful guesses, models will keep learning to guess,” the researchers stated at OpenAI.