SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning
Haoyu Huang, Jinfa Huang, Zhongwei Wan, Xiawu Zheng, Rongrong Ji, Jiebo Luo
2026-03-25
Summary
This paper introduces a new way to speed up 'agentic' AI systems, which are those that can look at images, think step-by-step, and use tools to solve problems. These systems are getting really good, but they can be slow because each step depends on the previous one.
What's the problem?
Current agentic AI systems work by repeatedly looking at something, figuring out what to do next, and then using a tool. This process happens one step at a time, creating a bottleneck. This 'agentic depth' makes the system slow and limits how many tasks it can handle simultaneously, especially when dealing with complex problems that require many steps.
What's the solution?
The researchers developed a system called SpecEyes that tries to predict what the AI will do next *before* it actually does it. They use a smaller, faster AI model to 'speculate' on the best course of action. This allows the system to skip unnecessary steps and speed things up. A 'cognitive gate' helps decide when it's safe to trust the prediction, and a clever design allows the fast and slow parts of the system to work together efficiently.
Why it matters?
This work is important because it makes agentic AI systems much faster and more efficient. By reducing the time it takes to solve problems, SpecEyes allows these systems to handle more tasks at once and perform better overall, making them more practical for real-world applications. The experiments showed significant speed improvements without sacrificing accuracy, and even sometimes improving it.
Abstract
Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.