EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings
Shiva Krishna Reddy Malay, Shravan Nayak, Jishnu Sethumadhavan Nair, Sagar Davasam, Aman Tiwari, Sathwik Tejaswi Madhusudhan, Sridhar Krishna Nemala, Srinivas Sunkara, Sai Rajeswar
2026-03-17
Summary
This paper introduces a new way to test how well large language models can act as helpful AI assistants in real-world business settings, going beyond simply providing information.
What's the problem?
Currently, the tests used to evaluate these AI models don't accurately reflect the challenges of a typical workplace, like needing to plan over a long period, dealing with constantly changing information, and following strict security rules. Because of this, even advanced models aren't reliable enough to be used independently in businesses.
What's the solution?
The researchers created a simulated business environment called EnterpriseOps-Gym. This environment includes databases, tools, and realistic tasks from areas like customer service, HR, and IT. They then tested 14 different AI models on over 1,150 tasks within this environment to see how well they could handle complex workflows. They also tested what happened when the AI was given perfect plans to follow, to see where the AI struggled the most.
Why it matters?
The results showed that even the best models still have significant limitations, succeeding less than 40% of the time. The biggest issue wasn't understanding information, but rather strategic planning. The research highlights that current AI isn't ready for fully autonomous use in businesses and provides a valuable tool for improving these models in the future, making them more reliable and safe for real-world applications.
Abstract
Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.5 achieves only 37.4% success. Further analysis shows that providing oracle human plans improves performance by 14-35 percentage points, pinpointing strategic reasoning as the primary bottleneck. Additionally, agents frequently fail to refuse infeasible tasks (best model achieves 53.9%), leading to unintended and potentially harmful side effects. Our findings underscore that current agents are not yet ready for autonomous enterprise deployment. More broadly, EnterpriseOps-Gym provides a concrete testbed to advance the robustness of agentic planning in professional workflows.