Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought

Yi Peng, Chris, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou

2025-04-09

Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought

Summary

This paper talks about Skywork R1V, an AI model that can understand both images and text, solving problems like math questions or science diagrams by breaking them down step-by-step like a student showing their work.

What's the problem?

Existing AI models either struggle to combine visual and text information effectively or waste time overthinking simple problems when processing images and text together.

What's the solution?

Skywork R1V uses a lightweight adapter to connect vision and language parts without retraining them, a hybrid training method to align images with text, and smart step-by-step reasoning that adjusts how detailed its explanations need to be.

Why it matters?

This helps create AI tools that can solve real-world problems involving both pictures and words, like homework helpers for math diagrams or medical AI analyzing X-rays with patient notes, while being efficient enough to run on standard computers.

Abstract

We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text alignment, we propose a hybrid optimization strategy that combines Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly enhancing cross-modal integration efficiency. Additionally, we introduce an adaptive-length Chain-of-Thought distillation approach for reasoning data generation. This approach dynamically optimizes reasoning chain lengths, thereby enhancing inference efficiency and preventing excessive reasoning overthinking. Empirical evaluations demonstrate that Skywork R1V, with only 38B parameters, delivers competitive performance, achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista. Meanwhile, it maintains robust textual reasoning performance, evidenced by impressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model weights have been publicly released to promote openness and reproducibility.

View Paper