Large language model (LLM)-based agents are showing limitations when applied to real-world customer relationship management (CRM) scenarios. A recent study conducted by Salesforce using its CRMArena-Pro benchmark highlights key performance gaps, especially in handling complex tasks and managing customer data responsibly.
According to the results, LLM agents achieved around 58% success rates in single-step CRM problems—tasks that require just one straightforward action. However, performance dropped significantly to only 35% when agents faced more complex, multi-step scenarios.
Another critical concern raised in the study was the lack of confidentiality awareness exhibited by these AI systems. The agents failed to recognize or properly handle sensitive customer data. While this issue could be mitigated through better prompt engineering, doing so negatively impacts overall task performance, creating a trade-off between safety and efficiency.
These insights were first reported by The Register and bring attention to the importance of aligning LLMs with real-world business needs—especially in fields where accuracy, data privacy, and trust are essential.
As enterprises increasingly look toward automation to streamline CRM, these findings serve as a valuable reminder that human oversight and thoughtful implementation remain crucial.


