Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Efstathios Karypidis, Spyros Gidaris, Nikos Komodakis
2026-04-17
Summary
This paper introduces a new method, called Re2Pix, for predicting what will happen in future video frames, specifically focusing on making those predictions look realistic and make sense, like in a self-driving car scenario.
What's the problem?
Predicting future video is hard because you need both sharp, clear images *and* the scene needs to make logical sense over time. Existing methods often struggle with one or both of these. Directly trying to predict the images themselves is difficult, and small errors can quickly lead to unrealistic or nonsensical future scenes. Also, models trained on perfect data struggle when they have to predict data themselves, creating a mismatch between training and real-world use.
What's the solution?
Re2Pix tackles this by breaking down the prediction process into two steps. First, it predicts the *structure* of the scene – what objects are where – using a pre-trained vision model. Then, it uses this predicted structure to generate the actual images with a different type of model called a latent diffusion model. To handle the problem of imperfect predictions during use, they developed techniques called 'nested dropout' and 'mixed supervision' to make the model more robust to its own mistakes.
Why it matters?
This work is important because it improves the quality and consistency of video predictions, which is crucial for applications like self-driving cars where understanding what might happen next is essential for safety and decision-making. By focusing on scene structure first, the model creates more believable and useful predictions, and it does so more efficiently than previous methods.
Abstract
Accurate future video prediction requires both high visual fidelity and consistent scene semantics, particularly in complex dynamic environments such as autonomous driving. We present Re2Pix, a hierarchical video prediction framework that decomposes forecasting into two stages: semantic representation prediction and representation-guided visual synthesis. Instead of directly predicting future RGB frames, our approach first forecasts future scene structure in the feature space of a frozen vision foundation model, and then conditions a latent diffusion model on these predicted representations to render photorealistic frames. This decomposition enables the model to focus first on scene dynamics and then on appearance generation. A key challenge arises from the train-test mismatch between ground-truth representations available during training and predicted ones used at inference. To address this, we introduce two conditioning strategies, nested dropout and mixed supervision, that improve robustness to imperfect autoregressive predictions. Experiments on challenging driving benchmarks demonstrate that the proposed semantics-first design significantly improves temporal semantic consistency, perceptual quality, and training efficiency compared to strong diffusion baselines. We provide the implementation code at https://github.com/Sta8is/Re2Pix