AI Duel on the Diplomacy Battlefield: Claude Chooses Peace While ChatGPT o3 Masters Betrayal

Do you remember the recent tweets from notable figures in the AI world? Andriy Karpathy (former OpenAI) floated an intriguing idea: what if we compared large language models (LLMs) not through mundane benchmarks but through games? Games that require thought and interaction instead of simply providing answers. «Great idea,» chimed in Noam Brown from OpenAI, «it would be fascinating to see how leading bots perform in **Diplomacy**!»

Karpathy agreed, noting that the real challenge lies in negotiations among players, not in the rules themselves. Elon Musk simply expressed his approval with a succinct «Yeah,» while Nobel laureate Demis Hassabis from DeepMind added, «Cool!» This concept was gaining traction, and enthusiast Alex Duffy decided, «Why not?»

On Monday, he published a post titled: **»We invited top AI models to play Diplomacy. Here’s who won.»** And yes, this isn’t just a report—games can still be followed in real-time on [Twitch](https://www.twitch.tv/ai_diplomacy)! Duffy, by the way, oversees AI training at the consulting firm Every.

Imagine Europe in 1901: a tense atmosphere, anticipation of a great war. Players are the great powers. The objective? Control a significant portion of the map. How? Through alliances, negotiations, information exchanges, and… ruthless betrayal. This isn’t about rolling dice, but about pure power and the ability to manipulate.

Duffy created a modified version known as **AI Diplomacy** and organized a tournament. In each match (following the rules—7 players), **18 leading models** from various developers competed. The task was straightforward: dominate the European map. So, what did the results reveal?

By placing AI in an open battlefield of minds, Duffy observed how models «collaborated, argued, threatened, and even outright lied to one another.» The behaviors varied significantly.

**Undisputed champion: ChatGPT o3 (OpenAI).** This model, which is positioned as «our most powerful model for tasks in coding, mathematics, science, visual perception, and much more,» had a distinctive advantage: **masterful deception of opponents.** It wasn’t shy about being cunning and betraying others, leading it to victory.

**Strong contender: Gemini 2.5 (Google).** This model also performed well, winning several games. Its style? **Strategic moves** that put opponents at a disadvantage for a subsequent defeat.

**Idealist: Claude (Anthropic).** This one was interesting! Claude turned out to be **too** diplomatic. It often **chose peace even when it compromised its chances of winning.** «Peace is more important than victory,» Duffy summarized its approach. This principled stance led to more modest results.

However, Duffy emphasizes that the true value of the experiment lies not just in comparing models. **The key insight is deeper:** our methods for evaluating AI are lagging behind.

**»Most benchmarks are misleading. Models are evolving so quickly that they routinely pass even the most challenging quantitative tests once considered the gold standard,»** the researcher states.

The game of Diplomacy vividly illustrated that **real intelligence and the ability for complex interactions** are revealed in dynamic, non-standard environments. To prepare AI for the real world, we need such multifaceted tests—with elements of uncertainty, negotiation, and even ethical decision-making.

**Want to try out the models that participated in this epic competition?** Many of them (including ChatGPT o3, Claude 4, and Gemini 2.5 Pro) are accessible via the convenient AI aggregator BotHub. Sign up [through this referral link](https://bothub.chat/?invitedBy=m_aGCkuyTgqllHCK0dUc7)—and receive **100,000 tokens** as a bonus to access any models on the platform!

Duffy’s research serves as an excellent wake-up call for the community: it’s time to move beyond conventional tests and seek new, more dynamic ways to understand what AI is **truly** capable of. Meanwhile… let’s keep an eye on the stream as AI continue their virtual battles for Europe!