MuSEAgent: A Multimodal Reasoning Agent with Stateful Experiences
Shijian Wang, Jiarui Jin, Runhao Fu, Zexuan Yan, Xingjian Wang, Mengkang Hu, Eric Wang, Xiaoxi Li, Kangning Zhang, Li Yao, Wenxiang Jiao, Xuelian Cheng, Yuan Lu, Zongyuan Ge
2026-03-31
Summary
This paper introduces MuSEAgent, a new type of AI agent designed to be better at researching and making decisions by learning from past experiences using both text and images.
What's the problem?
Current AI agents that research often struggle to effectively use information from previous attempts, especially when dealing with both images and text. They typically just look at entire past 'paths' of actions, which isn't very efficient or focused on what *specifically* worked or didn't work in each situation.
What's the solution?
MuSEAgent solves this by creating a 'memory bank' of individual, successful steps – called 'experiences' – taken during research. It figures out what made those steps good using a technique called 'hindsight reasoning'. When facing a new problem, it smartly searches this memory bank for relevant experiences to guide its actions, using both broad and focused search strategies to consider different perspectives. It doesn't just remember whole sequences, but the key decisions within them.
Why it matters?
This research is important because it makes AI agents more capable of complex tasks that require understanding both visual and textual information. By learning from specific experiences, MuSEAgent can make better decisions and perform more effectively than previous methods, paving the way for more intelligent and helpful AI assistants.
Abstract
Research agents have recently achieved significant progress in information seeking and synthesis across heterogeneous textual and visual sources. In this paper, we introduce MuSEAgent, a multimodal reasoning agent that enhances decision-making by extending the capabilities of research agents to discover and leverage stateful experiences. Rather than relying on trajectory-level retrieval, we propose a stateful experience learning paradigm that abstracts interaction data into atomic decision experiences through hindsight reasoning. These experiences are organized into a quality-filtered experience bank that supports policy-driven experience retrieval at inference time. Specifically, MuSEAgent enables adaptive experience exploitation through complementary wide- and deep-search strategies, allowing the agent to dynamically retrieve multimodal guidance across diverse compositional semantic viewpoints. Extensive experiments demonstrate that MuSEAgent consistently outperforms strong trajectory-level experience retrieval baselines on both fine-grained visual perception and complex multimodal reasoning tasks. These results validate the effectiveness of stateful experience modeling in improving multimodal agent reasoning.