Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

Weihao Bo, Shan Zhang, Yanpeng Sun, Jingjing Wu, Qunyi Xie, Xiao Tan, Kunbin Chen, Wei He, Xiaofan Li, Na Zhao, Jingdong Wang, Zechao Li

2025-11-28

Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

Summary

This paper introduces ViLoMem, a new way to help large multimodal language models (MLLMs) learn from their past experiences and avoid making the same mistakes repeatedly when solving problems involving both images and text.

What's the problem?

Current MLLMs are good at answering questions, but they treat each question as brand new, often repeating errors. Existing systems try to help by remembering past attempts, but these memories are limited – they don't store enough information and usually only remember what *happened*, not *why* it happened. Crucially, they don't effectively combine information about what the model was looking at in an image with the logical steps it took to reach an answer, which is how humans learn and remember things.

What's the solution?

ViLoMem tackles this by creating a memory system with two separate tracks. One track remembers distracting elements in images that led to errors, and the other track remembers logical reasoning mistakes. This allows the model to specifically learn from both visual and logical failures. The system builds up this knowledge gradually, updating it over time to avoid forgetting previously learned strategies, and it organizes information into reusable patterns.

Why it matters?

This research is important because it moves beyond simply remembering past actions and towards building MLLMs that truly *learn* from their experiences. By separating and remembering visual distractions and reasoning errors, ViLoMem helps models become more reliable and capable of solving complex multimodal problems over time, and even applying what they learn to new situations.

Abstract

MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo -- solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge -- preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction--hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning. Our project page will be available at https://weihao-bo.github.io/ViLoMeo-page.

View Paper