Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation
Su Ho Han, Jeongseok Hyun, Pilhyeon Lee, Minho Shim, Dongyoon Wee, Seon Joo Kim
2025-10-23
Summary
This paper presents a new way to pinpoint objects within videos using existing AI models without needing to retrain them. It focuses on leveraging how these models already 'look' at different parts of a video when answering questions about it.
What's the problem?
Current AI models that understand videos can identify *what* is happening, but it's hard to get them to specifically *show* you *where* something is in the video without a lot of extra training. The 'attention' within these models, which shows where they're focusing, is often messy and doesn't clearly outline the objects you're looking for.
What's the solution?
The researchers developed a technique called Decomposed Attention Fusion, or DecAF. This method cleans up the attention maps by first separating what's important (the object) from what's not (the background). Then, it combines information from different frames of the video to create a more complete picture. Finally, they use this refined attention to guide another AI tool, SAM, to create precise outlines around the objects. Importantly, none of this requires changing how the original video understanding model works – it's all done 'on top' of it.
Why it matters?
This work is significant because it allows for accurate object localization in videos without the time and resources needed for retraining large AI models. This makes it easier to apply these powerful models to new tasks and datasets, and it brings us closer to AI systems that can truly 'see' and understand the visual world like humans do.
Abstract
Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks. The code will be available at https://github.com/HYUNJS/DecAF.