AgentOCR: Reimagining Agent History via Optical Self-Compression
Lang Feng, Fuchao Yang, Feng Chen, Xin Cheng, Haiyang Xu, Zhenglin Wan, Ming Yan, Bo An
2026-01-12
Summary
This paper introduces a new way to help AI agents remember past interactions without using up too much computer memory. It focuses on making these agents, powered by large language models, more practical for real-world use.
What's the problem?
AI agents that learn through trial and error, like playing a game or answering questions, need to remember what happened in previous steps to make good decisions. However, storing all that information as text takes up a lot of space and processing power, especially as the interaction gets longer. This limits how complex these agents can be and how long they can operate effectively.
What's the solution?
The researchers developed a system called AgentOCR that represents the agent's history as a picture instead of text. Think of it like creating a visual summary of everything that's happened. They also created a clever caching system to avoid redrawing the same parts of the picture repeatedly, speeding things up. Finally, the agent itself learns to decide how much detail to include in the picture, balancing the need for information with the desire to keep the picture small and efficient.
Why it matters?
This work is important because it makes it possible to build more capable and efficient AI agents. By significantly reducing the memory and processing requirements, AgentOCR opens the door to using these agents in more complex tasks and real-world applications where resources are limited. It shows a promising path towards making AI agents more practical and scalable.
Abstract
Recent advances in large language models (LLMs) enable agentic systems trained with reinforcement learning (RL) over multi-turn interaction trajectories, but practical deployment is bottlenecked by rapidly growing textual histories that inflate token budgets and memory usage. We introduce AgentOCR, a framework that exploits the superior information density of visual tokens by representing the accumulated observation-action history as a compact rendered image. To make multi-turn rollouts scalable, AgentOCR proposes segment optical caching. By decomposing history into hashable segments and maintaining a visual cache, this mechanism eliminates redundant re-rendering. Beyond fixed rendering, AgentOCR introduces agentic self-compression, where the agent actively emits a compression rate and is trained with compression-aware reward to adaptively balance task success and token efficiency. We conduct extensive experiments on challenging agentic benchmarks, ALFWorld and search-based QA. Remarkably, results demonstrate that AgentOCR preserves over 95\% of text-based agent performance while substantially reducing token consumption (>50\%), yielding consistent token and memory efficiency. Our further analysis validates a 20x rendering speedup from segment optical caching and the effective strategic balancing of self-compression.