Salesforce CRMArena-Pro Test Reveals AI Agents Struggle with Complex Business Tasks

A new Salesforce CRMArena-Pro test reveals significant challenges faced by AI agents in a business context. Even top models like Gemini 2.5 Pro achieve a successful response rate of only 58% on a single attempt. Performance declines to 35% in longer conversations.

The CRMArena-Pro is designed to assess the effectiveness of large language models (LLMs) as agents in real business scenarios, particularly in CRM tasks such as sales, customer service, and pricing. This test builds on the original CRMArena by introducing more business functions, multi-step dialogues, and data privacy assessments. Using synthetic data within Salesforce, the team created 4,280 tasks covering 19 types of business operations and three categories of data protection.

The findings highlight the limitations of contemporary LLMs. For straightforward, single-step tasks, advanced models like Gemini 2.5 Pro only reach a maximum accuracy of 58%. However, when required to engage in multi-step dialogues and ask questions to fill in missing details, their performance plummets to 35%.

Salesforce conducted extensive testing with nine LLMs and found that most struggled to ask the right clarifying questions. An analysis of 20 failed multi-step tasks using Gemini 2.5 Pro revealed that nearly half could not be resolved due to the model’s failure to request crucial information. Models that ask more questions perform better on these tasks.

The best results were seen in workflow automation areas, such as routing customer support inquiries, where Gemini 2.5 Pro achieved an 83% success rate. However, accuracy significantly dropped when handling tasks that required text comprehension or rule-following, such as identifying incorrect product configurations or extracting information from call logs.

Previous research conducted by Salesforce and Microsoft has uncovered similar issues: even the most advanced LLMs become much less reliable as conversations lengthen and users gradually reveal their needs. In such multi-step scenarios, performance decreased by an average of 39%.

This test also highlights gaps in data privacy safeguards. By default, LLMs almost never recognize or reject requests for confidential information, such as personal data or sensitive corporate details.

Only after explicitly indicating privacy rules in the system prompts did models begin to decline these requests, but this came at the cost of overall performance. For instance, GPT-4o increased its detection of confidential data from zero to 34.2%, but task completion dropped by 2.7%.

Open-source models, like LLaMA-3.1, reacted even less to prompt adjustments, indicating they require more rigorous training to follow priority instructions effectively.

Kung-Siang Steve Huang, one of the authors, notes that data protection testing has been infrequently included in comparative studies until now. CRMArena-Pro represents the first systematic effort to measure this parameter.

While you’re here, I recommend checking out BotHub, a platform where you can test all popular models without restrictions. No VPN is needed to access the service, and you can use a Russian card. Click here to receive 100,000 free tokens for your initial tasks and start working right away!

Source