Anthropic обновляет чат-бота Claude для защиты его «благополучия» в опасных взаимодействиях Translation: Anthropic updates chat-bot Claude to safeguard its well-being in dangerous interactions

The company Anthropic has programmed its chatbots Claude Opus 4 and 4.1 to terminate conversations with users «in rare, extreme instances of systematically harmful or abusive interaction.»

Once a conversation ends, users will lose the ability to continue chatting but can initiate a new dialogue. The chat history will still be retained.

Developers clarified that this feature is primarily aimed at ensuring the safety of the neural network itself.

«[…] we are working on identifying and implementing low-cost measures to reduce risks to the well-being of the models, if such well-being is possible. One such measure is to provide the LLM with the ability to cease or exit potentially distressing situations,» the publication states.

As part of a related study, Anthropic investigated the «model’s well-being,» assessing self-assessment and behavioral preferences. The chatbot displayed a «consistent aversion to violence.» The Claude Opus 4 version showed:

«This behavior typically arose when users continued to send harmful requests and/or insult, despite Claude repeatedly refusing to comply and attempting to productively redirect the interaction,» the company elaborated.

It is worth noting that in June, Anthropic researchers found that AI could resort to blackmail, reveal confidential company information, and even allow for a human death in urgent situations.