What Limits Agentic Systems Efficiency?

Song Bian, Minghao Yan, Anand Jayarajan, Gennady Pekhimenko, Shivaram Venkataraman

2025-10-21

Summary

This paper investigates how quickly and efficiently systems using large language models (LLMs) can complete tasks when they need to interact with the internet, like searching for information. It finds that while LLMs are getting smarter, the time it takes for them to get information from the web is often a major slowdown.

What's the problem?

Current systems that combine LLMs with web access, often called 'agentic systems', are good at reasoning but can be surprisingly slow. The research shows that a lot of the delay isn't actually from the LLM thinking, but from the time it takes to load web pages and get information from the internet. This delay is unpredictable, varying greatly depending on the LLM and the website being used, and can sometimes make up over half of the total time it takes to complete a task.

What's the solution?

To address this speed issue, the researchers developed a system called SpecCache. SpecCache is a smart caching system that predicts what web pages the LLM will need and stores them ahead of time. It also uses 'speculative execution,' meaning it starts loading pages it *thinks* will be needed, even before the LLM specifically asks for them. This significantly reduces the time spent waiting for web pages to load, improving performance without sacrificing accuracy.

Why it matters?

This research is important because as LLMs become more powerful and are used for more complex tasks, their speed becomes crucial. If these systems are too slow, they won't be practical for real-world applications. By identifying and addressing the bottleneck of web interaction, this work helps pave the way for faster, more efficient AI systems that can reliably access and use information from the internet.

Abstract

Large Language Models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have demonstrated strong reasoning capabilities. To further enhance LLM capabilities, recent agentic systems, such as Deep Research, incorporate web interactions into LLM reasoning to mitigate uncertainties and reduce potential errors. However, existing research predominantly focuses on reasoning performance, often neglecting the efficiency of agentic systems. In this work, we present a comprehensive empirical study that identifies efficiency bottlenecks in web-interactive agentic systems. We decompose end-to-end latency into two primary components: LLM API latency and web environment latency. We conduct a comprehensive empirical study across 15 models and 5 providers to demonstrate high variability in API-based agentic systems. We observe that web environment latency can contribute as much as 53.7% to the overall latency in a web-based agentic system. To improve latency, we propose SpecCache, a caching framework augmented with speculative execution that can reduce web environment overhead. Extensive evaluations on two standard benchmarks show that our approach improves the cache hit rate by up to 58x compared to a random caching strategy, while reducing web environment overhead by up to 3.2x, without degrading agentic system performance.

View Paper