Kling-Omni Technical Report

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, Xiao Hu, Xiaohua Hu, Boyuan Jiang, Fangyuan Kong, Hang Li, Jie Li, Qingyu Li, Shen Li, Xiaohan Li, Yan Li, Jiajun Liang, Borui Liao

2025-12-19

Summary

This paper introduces Kling-Omni, a new system that can create realistic videos from different kinds of inputs like text, images, and even other videos.

What's the problem?

Currently, creating videos often involves separate steps for generation, editing, and understanding what you want the video to do. These steps don't always work well together, and it's hard to give the system complex instructions using multiple types of information at once. Existing methods often require a lot of specialized tools and aren't very flexible.

What's the solution?

Kling-Omni solves this by combining all these steps into one system. It takes text, images, or videos as input, understands them all together, and then generates a high-quality video based on that understanding. They built a large collection of video data to train the system and used clever techniques to make it work efficiently. Essentially, it's a single system that can handle a wide range of video creation tasks.

Why it matters?

This work is important because it's a step towards creating AI systems that can truly understand and interact with the real world. Instead of just making videos, Kling-Omni could eventually be part of a system that can simulate entire environments, reason about them, and respond to changes – almost like a virtual world you can interact with.

Abstract

We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.

View Paper