CoMemo: LVLMs Need Image Context with Image Memory
Shi Liu, Weijie Su, Xizhou Zhu, Wenhai Wang, Jifeng Dai
2025-06-19
Summary
This paper talks about CoMemo, a new design for large vision-language models (LVLMs) that helps the AI better understand images by using two separate paths: one for image context and one for image memory.
What's the problem?
The problem is that many existing LVLMs tend to ignore important parts of images, especially as they process longer and more detailed visual information. Also, usual ways of encoding the position of image parts don't keep the full two-dimensional structure, making it harder for the AI to understand images well.
What's the solution?
The researchers created a dual-path architecture where one path processes image tokens together with text to provide context, while the other path stores image information as a kind of memory that doesn’t depend on the text context. They also introduced a new positional encoding method called RoPE-DHR that keeps the 2D spatial relationships in images better, even for high-resolution images. This design helps the model keep track of important visual details and improves performance on tasks like understanding long context, reasoning with multiple images, and answering questions about images.
Why it matters?
This matters because it makes AI models much better at analyzing and understanding complex images along with text, which helps improve applications like visual question answering, image captioning, and any field that needs advanced image and language understanding.
Abstract
CoMemo addresses visual information neglect and spatial awareness in multimodal processing by using a dual-path architecture and a novel positional encoding mechanism.