Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence
Kun Ouyang, Yuanxin Liu, Linli Yao, Yishuo Cai, Hao Zhou, Jie Zhou, Fandong Meng, Xu Sun
2025-10-24
Summary
This paper introduces Conan, a new system designed to help AI understand and reason about videos by looking at multiple steps and pieces of visual evidence, ultimately improving its accuracy in answering questions about what's happening in the video.
What's the problem?
Current AI models struggle with truly *understanding* videos. They often make guesses that aren't based on what's actually shown (hallucinations) or have trouble pinpointing the exact parts of a video that support their reasoning. Some methods use reinforcement learning to improve reasoning, but they can be unreliable, while others focus on finding relevant video frames but still miss important details or get the location wrong.
What's the solution?
The researchers created Conan, which works in three main steps: identifying important video frames that provide context and evidence, carefully reasoning across those frames to connect clues, and deciding when it has enough information to answer a question or if it needs to look for more evidence. To train Conan, they built a large dataset of videos with step-by-step reasoning traces and used a special training method that helps the AI learn quickly and effectively.
Why it matters?
This work is important because it significantly improves the ability of AI to reason about videos, achieving better results than previous methods. This is a step towards AI systems that can truly understand and interact with the visual world, which has applications in areas like robotics, self-driving cars, and video analysis.
Abstract
Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding but still struggle with inaccurate evidence localization. To address these challenges, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies contextual and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we (1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that includes frame identification, evidence reasoning, and action decision, and (2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to jointly enhance multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long-video understanding tasks, validating its strong scalability and robustness.