Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

Junhao Cheng, Liang Hou, Xin Tao, Jing Liao

2025-11-21

Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

Summary

This paper introduces a new way to have AI predict what happens next in a video, but instead of just *telling* you with text, it *shows* you with a generated video clip.

What's the problem?

Current AI models are good at understanding both images and language, but struggle to combine that understanding to create realistic and helpful videos that answer questions about what will happen next in a process or situation. Existing systems usually just give a text-based answer, which isn't always the best way to learn something visual, like how to do a task. It's hard to make a video that accurately follows instructions and looks natural at the same time.

What's the solution?

The researchers created a model called VANS, which stands for Video-ANSwer. VANS uses two main parts: a Vision-Language Model (VLM) to understand the question and the video, and a Video Diffusion Model (VDM) to actually create the new video. They linked these two parts together using a technique called reinforcement learning, so they work as a team. The VLM writes captions describing what should happen in the next video, and the VDM uses those captions to generate the video itself. They also created a new dataset, VANS-Data-100K, specifically for training and testing this type of video prediction.

Why it matters?

This work is important because it moves beyond simply using AI to generate entertainment videos and explores a more practical application: using video to teach and explain things. Being able to *show* someone the next step in a process is often much clearer than just *telling* them, and this research brings us closer to AI systems that can do that effectively, potentially helping people learn new skills or explore creative ideas.

Abstract

While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in https://github.com/KlingTeam/VANS.

View Paper