GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Boyan Shi, Changyu Pang

2025-07-02

GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable
Reinforcement Learning

Summary

This paper talks about GLM-4.1V-Thinking, a vision-language model that is really good at understanding and using both images and text together. The model is designed using a special training method focused on reasoning, which helps it perform well across many difficult tasks like solving science problems, understanding videos, and reading long documents.

What's the problem?

The problem is that many existing models struggle to do well on a wide variety of tasks that require looking at images and reading text at the same time, especially when these tasks need deep thinking and understanding across different areas.

What's the solution?

The researchers built GLM-4.1V-Thinking by first training a strong vision model and then using a training method called Reinforcement Learning with Curriculum Sampling to improve its reasoning skills. This approach allowed the model to get better at many tasks and even outperform much bigger models in several tests.

Why it matters?

This matters because having a model that can understand and reason using both pictures and words at a high level opens up many new possibilities in AI, like better education tools, smarter robots, and improved ways for computers to help with complex problems.

Abstract

A vision-language model (VLM) named GLM-4.1V-Thinking, developed with a reasoning-centric training framework, achieves state-of-the-art performance across various tasks, including STEM problem solving, video understanding, and long document understanding, outperforming larger models on many benchmarks.

View Paper