Watch Before You Answer: Learning from Visually Grounded Post-Training
Yuxuan Zhang, EunJeong Hwang, Huaisong Zhang, Penghui Du, Yiming Jia, Dongfu Jiang, Xuan He, Shenhui Zhang, Ping Nie, Peter West, Kelsey R. Allen
2026-04-08
Summary
This paper investigates a problem with how we're currently testing and improving vision-language models, which are AI systems that try to understand both images and text, especially in videos.
What's the problem?
Currently, many tests used to see how well these models understand videos are flawed. A large percentage of the questions can actually be answered just by reading the text provided, without needing to actually 'see' and understand the video content. This means we're overestimating how well these models truly grasp visual information, and the datasets used to improve them after initial training also suffer from this issue, making it hard to actually improve their video understanding abilities.
What's the solution?
The researchers created a new, cleaner dataset called VidGround. This dataset only includes questions that *require* looking at the video to answer, removing the questions that rely solely on text. They then used this dataset alongside existing methods to 'fine-tune' the models, and found that using only the visually grounded questions significantly improved performance compared to using the original, flawed datasets, even with less data overall.
Why it matters?
This work highlights that the quality of the data used to train and test these models is crucial. Simply throwing more data at the problem isn't enough; the data needs to genuinely challenge the model's ability to understand video. Focusing on creating better datasets and benchmarks will be key to making real progress in video understanding for AI.
Abstract
It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.