GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

V Team, Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, Xijun Liu, Wenmeng Yu, Weihan Wang, Wei Li, Shuaiqi Duan, Sheng Yang, Ruiliang Lv, Mingdao Liu, Lihang Pan, Ke Ning, Junhui Ji

2026-04-30

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

Summary

This paper introduces GLM-5V-Turbo, a new artificial intelligence model designed to be a strong foundation for creating AI agents that can interact with the real world in a more comprehensive way.

What's the problem?

Current AI models are really good at understanding and generating text, but they often struggle when you ask them to deal with different types of information like images, videos, or even interacting with a computer's interface. They treat these things as add-ons rather than core parts of their understanding, limiting their ability to act intelligently in complex situations.

What's the solution?

The researchers built GLM-5V-Turbo from the ground up to seamlessly integrate understanding different types of data – text, images, videos, and more – as a central part of its reasoning process. They improved the model's design, how it's trained on multimodal data, used reinforcement learning to refine its abilities, expanded the tools it can use, and connected it to existing agent frameworks. This allows it to perform tasks like understanding visual information to write code, use tools based on what it sees, and generally act more like an intelligent agent.

Why it matters?

This work is important because it provides a practical blueprint for building AI agents that can truly understand and interact with the world around them. It shows that focusing on multimodal perception – the ability to process different types of information together – is key to creating more capable and reliable AI systems, and offers insights into how to build and verify these complex agents.

Abstract

We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.

View Paper