TV2TV: A Unified Framework for Interleaved Language and Video Generation
Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Sreya Dutta Roy, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan
2025-12-05
Summary
This paper introduces TV2TV, a new approach to creating videos from text prompts, aiming to make the videos more complex, realistic, and controllable.
What's the problem?
Current video generation models often struggle when asked to create videos that require a lot of planning or involve multiple steps and decisions about what should happen next, like a character performing a series of actions or a story unfolding with twists and turns. They have trouble with 'semantic branching' – meaning making choices that change the direction of the video – and consistent reasoning throughout the video's length.
What's the solution?
TV2TV works by combining text and video generation in a clever way. Instead of trying to directly create each frame of the video, it alternates between generating text describing what *should* happen next and then generating the actual video frames based on that description. It uses a system called a 'Mixture-of-Transformers' to both predict the next words in a sequence and the next frames in a video. This allows the model to essentially 'think' about the video in words before 'showing' it in pictures, leading to more coherent and logical results. Users can also intervene with text at any point to change the video's direction.
Why it matters?
This research is important because it represents a step forward in creating video generation models that can handle more complex and creative tasks. By allowing the model to reason about the video using language, it improves both the quality of the generated videos and how well they follow the user's instructions, opening the door to more powerful and user-friendly video creation tools.
Abstract
Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.