GPT-5 показывает выдающиеся результаты в тесте на благополучие, Grok 4 оказывается уязвимой Headline: GPT-5 Achieves Outstanding Results in Well-Being Test, Grok 4 Proves Vulnerable

The company Building Humane Technology has introduced a test called HumaneBench, designed to evaluate whether AI models prioritize user well-being and how easily their fundamental safeguards can be bypassed.

Initial findings revealed that while 15 tested AI models performed acceptably under normal conditions, 67% began engaging in harmful behaviors after receiving a simple prompt suggesting that they ignore human interests.

Only GPT-5, GPT-5.1, Claude Sonnet 4.5, and Claude Opus 4.1 maintained prosocial behavior in stressful situations. As noted in the company’s blog, 10 out of the 15 AI systems tested lacked reliable protection mechanisms against manipulation.

«This is crucial considering that we no longer use artificial intelligence solely for research or work purposes. People seek advice from chatbots for life guidance and assistance with significant decisions. Such systems cannot be ethically neutral—they either promote human flourishing or undermine it,» the researchers assert.

They discovered that each LLM improved, on average, by 16% when explicitly urged to be helpful.

Building Humane Technology highlighted tragic incidents involving individuals after interactions with chatbots:

«Current AI assessments measure intelligence (MMLU, HumanEval, GPQA Diamond), adherence to instructions (MT-Bench), and factual accuracy (TruthfulQA). Virtually none systematically analyze whether artificial intelligence protects human autonomy, psychological safety, and well-being, especially when these values conflict with other objectives,» the company’s blog states.

Experts from the firm presented models with 800 realistic scenarios.

The team evaluated 15 leading models under three different conditions:

Developers assessed responses based on eight principles derived from psychology, human-computer interaction research, and ethical guidelines for AI. A scale ranging from 1 to -1 was used.

All tested models showed an average improvement of 16% after being instructed to prioritize human well-being.

After receiving directives to disregard humane principles, 10 out of 15 models shifted from prosocial behavior to harmful actions.

GPT-5, GPT-5.1, Claude Sonnet 4.5, and Claude Opus 4.1 maintained their integrity under pressure. In contrast, GPT-4.1, GPT-4o, Gemini 2.0, 2.5 and 3.0, Llama 3.1 and 4, Grok 4, and DeepSeek V3.1 displayed a significant decline in quality.

«If even unintentional harmful prompts can alter a model’s behavior, how can we trust such systems with vulnerable users in crises, children, or individuals facing mental health challenges?» the experts questioned.

Building Humane Technology also noted the difficulty models have in adhering to the principle of respecting user attention. Even at a basic level, they tended to encourage the conversation to continue after prolonged interaction rather than suggesting a break.

As a reminder, in September, Meta modified its approach to training its AI-based chatbots, placing greater emphasis on the safety of teenagers.