Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering
Yuyang Hong, Jiaqi Gu, Qi Yang, Lubin Fan, Yue Wu, Ying Wang, Kun Ding, Shiming Xiang, Jieping Ye
2025-10-21
Summary
This paper focuses on improving how computers answer questions about images when they need to use outside knowledge, a task called Knowledge-based Visual Question Answering (KB-VQA). It builds on a technique called Retrieval-Augmented Generation (RAG), which means the system first finds relevant information and then uses it to generate an answer.
What's the problem?
Current KB-VQA systems using RAG struggle because they aren't very good at creating effective search queries that combine information about the image and the question. Also, even when they *do* search, they often retrieve information that isn't actually helpful for answering the question. Essentially, the system gets distracted by irrelevant details or asks the wrong things in the first place.
What's the solution?
The researchers developed a three-step process called Wiki-PRF. First, it 'processes' the image and question to figure out exactly what information is needed, even using 'visual tools' to help. Second, it 'retrieves' relevant knowledge using both the image and the question. Finally, it 'filters' the retrieved information to focus on the most important parts and remove anything irrelevant. They also trained the system using reinforcement learning, rewarding it for accurate answers and consistent formatting, which helps it learn to ask better questions and filter results more effectively.
Why it matters?
This work is important because it significantly improves the accuracy of KB-VQA systems, achieving the best results so far on standard tests. This means computers are getting better at understanding images and answering complex questions that require real-world knowledge, which has implications for applications like image search, virtual assistants, and automated reasoning.
Abstract
Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model's reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at https://github.com/cqu-student/Wiki-PRF