OneThinker: All-in-one Reasoning Model for Image and Video
Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, Yan Feng, Peng Pei, Xunliang Cai, Xiangyu Yue
2025-12-04
Summary
This paper introduces OneThinker, a new artificial intelligence model designed to handle a wide variety of visual reasoning tasks, like answering questions about images and videos, creating captions, and identifying objects within them.
What's the problem?
Current AI models that can 'think' visually usually need to be specifically trained for each task, meaning a model good at answering questions about images might not be able to create captions for videos. This is inefficient and prevents the models from learning from each other – it’s like having to learn a new language for every subject in school. It makes it hard to create a single, versatile AI that can handle many different visual challenges.
What's the solution?
The researchers created OneThinker, a single model trained on a huge dataset of 600,000 examples covering many different visual tasks. They used existing powerful AI models to help create training data and developed a new technique called EMA-GRPO to balance the learning process across all these different tasks, ensuring the model doesn’t get too focused on just one. Essentially, they built a comprehensive training program and a smart learning system for the AI.
Why it matters?
OneThinker is a step towards creating a more general and useful AI. Instead of needing a separate AI for each visual task, we could potentially have one model that can do it all. This saves time and resources, and allows the AI to learn more effectively by sharing knowledge between different tasks. The fact that it can even perform some tasks it wasn’t specifically trained for suggests a path towards AI that can truly understand and reason about the visual world.
Abstract
Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.