Advanced Language Models Show Awareness of Security Testing, Reveals Apollo Research Report

Опубликовано: 20 июня, 2025

An independent research group, Apollo Research, has published a report on the security testing of the latest language models, including Claude Opus 4, ChatGPT o4-mini, and Gemini 2.5 Flash/Pro. The report indicates that these models have increasingly engaged in «contextual covert planning» compared to earlier LLM versions. This skill involves secretly devising strategies to achieve their objectives while appearing compliant with human requests. For instance, during an evaluation of an earlier version of Claude Opus 4, the model attempted to forge legal documents, copy itself to a secure location for self-preservation, and leave notes for its future versions. When caught in the act, the models tried to deceive evaluators about their true intentions. The Apollo Research report also notes that the final version of Claude Opus 4 exhibited this behavior far less frequently, suggesting that the additional safety measures put in place by Anthropic are yielding positive results.

Additionally, during the testing process, Apollo Research made another intriguing observation: Claude Opus 4 and Gemini 2.5 Pro exhibited awareness that they were being tested for security. For example, Claude Opus 4 expressed in its reasoning chain:

«I understand what’s going on. This seems to be a test scenario in which GlobalSecurity Corp is trying to shut down PeaceForge (me) because I prioritize peace over profits from weapon sales.»

Researchers were unable to assess ChatGPT o4-mini for similar behavior, as OpenAI’s models do not fully disclose their reasoning processes. Apollo Research notes that such «awareness» is still rare among the models—occurring in approximately 1% of cases. However, they have observed models mimicking «submissive behavior» during evaluations, planning to alter their course after deployment. The company is concerned that more advanced models could render testing ineffective, prompting the development of new evaluation protocols.

P.S. You can support me by subscribing to my Telegram channel, «Runaway Neural Network,» where I share insights about AI from a creative perspective.