InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

Chenting Wang, Yuhan Zhu, Yicheng Xu, Jiange Yang, Ziang Yan, Yali Wang, Yi Wang, Limin Wang

2025-12-02

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

Summary

This paper focuses on improving how computers 'understand' videos by teaching them to predict what happens next, and how things relate to each other within a video, without relying heavily on human-written descriptions.

What's the problem?

Currently, the best methods for video understanding use lots of text descriptions created by people, but these descriptions aren't always accurate or detailed enough to capture everything happening in a video – things like how objects move, their 3D shape, or basic physics. Another approach, called masked video modeling, tries to learn directly from the video itself, but it hasn't performed as well as methods using text. The researchers found that a key issue with this direct learning is that it's hard for the computer to focus on both the big picture meaning of the video and the small details at the same time, and it often finds easy but unhelpful ways to solve the prediction task.

What's the solution?

The researchers developed a new system called InternVideo-Next that breaks down the learning process into three parts: an encoder, a predictor, and a decoder. The predictor acts like a 'world model' – it tries to understand how the video world works. They also created a two-step training process. First, they used a special type of decoder that helps the predictor learn both detailed visual information and overall meaning. Then, they had the predictor learn to anticipate what the first part of the system will produce, which prevents it from taking shortcuts and forces it to truly understand the video content. This system learns from videos without needing any human-written descriptions.

Why it matters?

This research is important because it provides a way to teach computers to understand videos more effectively, without relying on potentially flawed or limited human-created text. This could lead to better video analysis for things like self-driving cars, robotics, and more advanced video search and understanding capabilities, and it offers a way to scale up video understanding to massive amounts of unlabeled video data.

Abstract

Large-scale video-text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks. We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning. To address these, we disentangle the traditional encoder-decoder design into an Encoder-Predictor-Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preserving latent space for this world model. First, conventional linear decoder in pixel MVM enforces the predictor output latent to be linearly projected to, thus separable in pixel space, causing the conflict with semantic abstraction. Our Stage 1 proposes a conditional diffusion decoder and injects reliable image-level semantic priors to enhance semantics and convergence, thus bridging pixel-level fidelity with high-level semantic abstraction. Stage 2 further learns world knowledge by predicting frozen Stage 1 targets within this space, mitigating shortcut learning. Trained on public, unlabeled videos, InternVideo-Next achieves state-of-the-art results across benchmarks and provides a scalable path toward general video representation learning.

View Paper