RELIC: Interactive Video World Model with Long-Horizon Memory

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, Hao Tan

2025-12-04

RELIC: Interactive Video World Model with Long-Horizon Memory

Summary

This paper introduces RELIC, a new system designed to create realistic and interactive virtual worlds. It aims to allow users to explore these worlds in real-time, remembering what happened over long periods, and responding precisely to user commands.

What's the problem?

Building truly interactive virtual worlds is really hard because current methods struggle to do three things at once: respond instantly to user input, remember events over a long time, and maintain a consistent understanding of the 3D space. Often, improving one aspect makes the others worse – for example, remembering a lot of past information can slow down the system and make it feel laggy. Existing systems usually focus on just one of these challenges, leaving a gap for a more complete solution.

What's the solution?

The researchers developed RELIC, which uses a clever combination of techniques. It compresses past events into a small amount of data, keeping track of both what actions were taken and where the 'camera' was looking. This allows it to quickly recall past events and maintain a consistent 3D view. They also improved how the system learns to predict future events by training it to look ahead further than it was originally designed for, using a new method that doesn't require a lot of extra computing power. Essentially, RELIC efficiently stores memories and uses them to generate realistic, continuous experiences.

Why it matters?

RELIC represents a significant step forward in creating the next generation of interactive virtual worlds. By solving the problems of real-time performance, long-term memory, and spatial consistency, it opens the door to more immersive and believable experiences in areas like gaming, simulation, and virtual reality. It provides a strong foundation for building worlds that feel truly alive and responsive to user actions.

Abstract

A truly interactive world model requires three key ingredients: real-time long-horizon streaming, consistent spatial memory, and precise user control. However, most existing approaches address only one of these aspects in isolation, as achieving all three simultaneously is highly challenging-for example, long-term memory mechanisms often degrade real-time performance. In this work, we present RELIC, a unified framework that tackles these three challenges altogether. Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time. Built upon recent autoregressive video-diffusion distillation techniques, our model represents long-horizon memory using highly compressed historical latent tokens encoded with both relative actions and absolute camera poses within the KV cache. This compact, camera-aware memory structure supports implicit 3D-consistent content retrieval and enforces long-term coherence with minimal computational overhead. In parallel, we fine-tune a bidirectional teacher video model to generate sequences beyond its original 5-second training horizon, and transform it into a causal student generator using a new memory-efficient self-forcing paradigm that enables full-context distillation over long-duration teacher as well as long student self-rollouts. Implemented as a 14B-parameter model and trained on a curated Unreal Engine-rendered dataset, RELIC achieves real-time generation at 16 FPS while demonstrating more accurate action following, more stable long-horizon streaming, and more robust spatial-memory retrieval compared with prior work. These capabilities establish RELIC as a strong foundation for the next generation of interactive world modeling.

View Paper