Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch
Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, Tianyidan Xie, Eric Li, Yang Liu, Xuchen Song, Yahui Zhou
2025-12-03
Summary
This paper introduces Skywork-R1V4, a new AI model that's really good at solving problems by combining what it 'sees' in images with information it finds online.
What's the problem?
Current AI systems that try to do both image-related tasks and web searches often treat these as separate things, making them less effective. They also frequently need a lot of complex and expensive training using reinforcement learning, and they don't always have a clear plan based on actually *doing* things and seeing the results. Basically, they struggle to connect thinking, acting, and learning from those actions.
What's the solution?
The researchers created Skywork-R1V4, a model with 30 billion parameters, that brings everything together. It can plan what to do, actively change images to help it think, search the web deeply, and constantly switch between looking at images and getting more information online. Importantly, it was trained using a relatively small amount of carefully prepared example data – about 30,000 examples – and a method to ensure those examples were consistent, avoiding the need for complicated reinforcement learning.
Why it matters?
This work shows that you can build a very capable AI that can handle complex tasks without relying on expensive and difficult reinforcement learning. Skywork-R1V4 actually *outperforms* other models like Gemini 2.5 Flash on several tests, and it can even figure out how to use multiple tools in a sequence to solve problems, demonstrating a level of reasoning that's exciting for the future of AI.
Abstract
Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.