Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Chunrui Han, Yuang Peng, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel

2025-07-14

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for
Visual Reasoning

Summary

This paper talks about Open Vision Reasoner, an AI system that uses techniques from language models to improve how computers understand and reason about images.

What's the problem?

AI systems often struggle to think through complex visual problems because they treat images like simple pictures rather than meaningful scenes with relationships, limiting their reasoning abilities.

What's the solution?

The researchers created a two-step process where the AI first learns language-based thinking with a big language model, then it uses reinforcement learning to combine this thinking with visual data, helping the model connect language reasoning with what it 'sees' in images.

Why it matters?

This matters because it advances AI's ability to understand images deeply and solve complicated visual tasks, making models more intelligent and closer to human-like thinking about pictures.

Abstract

A two-stage paradigm using Qwen2.5-VL-7B and multimodal reinforcement learning achieves state-of-the-art performance in visual reasoning benchmarks.

View Paper