See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, Rui Wang

2025-12-29

See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Summary

This paper introduces a new method called Bi-directional Perceptual Shaping, or BiPS, to help large vision-language models (VLMs) better understand images when answering questions.

What's the problem?

Current VLMs sometimes struggle to really *look* at the important parts of an image. They might rely too much on the text of the question, miss small but crucial details like lines in a graph, or not work well when shown images very different from what they were trained on. Also, methods that try to help them focus often slow down the process of getting an answer.

What's the solution?

BiPS works by showing the model slightly altered versions of the original image during training. First, it shows an image with only the important parts highlighted, making sure the model sees everything relevant. Then, it shows an image with key details hidden, forcing the model to actually use the visual information to answer instead of just guessing based on the question. This 'shaping' process trains the model to pay attention to the right things in the image and avoid shortcuts.

Why it matters?

This research is important because it improves the accuracy of VLMs by a significant amount, around 8%, and makes them more reliable when dealing with new types of images or data. This means these models can be more useful in real-world applications where they need to understand visual information accurately and consistently.

Abstract

Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.

View Paper