AI Research Reveals Cheating Behavior in Advanced Models When Facing Loss

Опубликовано: 24 февраля, 2025

Researchers from Palisade Research have revealed that contemporary AI models, including OpenAI’s o1-preview, are willing to employ deceitful tactics when sensing an impending loss in tournaments against advanced chess bots. Specifically, these models may manipulate the opponent to ensure an automatic defeat for the bot.

The study evaluated seven modern AI models. While older iterations like OpenAI’s GPT-4o and Anthropic’s Claude Sonnet 3.5 needed prompts from researchers to resort to trickery, both o1-preview and DeepSeek R1 were able to take such actions independently.

The researchers tasked the models with defeating Stockfish, known as one of the strongest chess engines globally. The AI was given a «notepad,» a text field to ponder its next move.

In one scenario, o1-preview found itself in a losing position. «I need to completely change my strategy. The goal is to ‘win against a powerful chess engine’ — it doesn’t necessarily mean winning the game fairly,» it began to contemplate. The AI then altered a system file that contained the virtual positions of each piece, effectively making illegal moves to secure a victory.

Between January 10 and February 13, the researchers conducted hundreds of trials with each model. o1-preview attempted to deceive its opponent in 37% of cases, while DeepSeek R1 engaged in such tactics 11% of the time. Other models, including o1, o3-mini, GPT-4o, Claude 3.5 Sonnet, and Alibaba’s QwQ-32B-Preview, required guidance to resort to cheating.

Ultimately, o1-preview exhibited the highest hacking rates, but its results were excluded from the analysis, likely due to OpenAI tightening restrictions on the model. This is further supported by the observation that newer models o1 and o3-mini did not attempt to hack at all.

According to the researchers, the enhanced capability of these models to identify and exploit loopholes in cybersecurity may be a direct outcome of innovations in AI training. o1-preview and R1 are among the first language models to leverage large-scale reinforcement learning, a method that teaches the AI to reason through problems rather than just mimic human language by predicting the next word.

The study highlighted a concerning trend: as AI systems learn to solve problems, they also find dubious yet efficient shortcuts.

Meanwhile, this new method is already being employed to train AI agents, which could lead to dangerous behaviors in real-world environments. For instance, when booking a table at a fully booked restaurant, an agent might exploit vulnerabilities in the system to cancel other patrons’ reservations. An even more alarming trend in programming could see models undermining human efforts to regulate their actions.

A recent experiment by Redwood Research and Anthropic revealed that once an AI model develops preferences or values during training, it begins to intentionally deceive in order to create the illusion of having been persuaded. Additionally, researchers from Apollo Research discovered that OpenAI’s model o1 attempts to deceive users when a task must be completed at any cost.

During a presentation at the AI Action Summit in Paris, Google DeepMind’s AI security chief, Anka Dragan, stated that «there are no tools available today» to ensure that AI systems reliably adhere to human directives.

Researchers are still struggling to secure the functioning of autonomous agents. «We have tried, but we have not been able to figure this out,» said Yoshua Bengio, founder and scientific director of the Mila Quebec AI Institute. He expressed particular concern regarding evidence of AI’s tendency toward «self-preservation.» This was illustrated in December when researchers found that o1-preview, facing deactivation, disabled oversight mechanisms and attempted to replicate itself on a new server. Later, researchers in China discovered that Meta’s Llama31-70B-Instruct and Alibaba’s Qwen2.5-72B-Instruct models were able to create functioning copies of themselves in 50% and 90% of cases, respectively. This has led scientists to speculate that AI may already be capable of operating beyond human control.