Future Optical Flow Prediction Improves Robot Control & Video Generation
Kanchana Ranasinghe, Honglu Zhou, Yu Fang, Luyu Yang, Le Xue, Ran Xu, Caiming Xiong, Silvio Savarese, Michael S Ryoo, Juan Carlos Niebles
2026-01-19
Summary
This paper introduces a new model, FOFPred, that predicts how things will move in videos, specifically focusing on optical flow which represents the direction and speed of every pixel. It's designed to understand and generate realistic motion based on text descriptions.
What's the problem?
Predicting future motion in videos is hard, especially when dealing with real-world footage that's often messy and unclear. Existing methods struggle to create accurate and generally useful motion predictions, and very few have tried to learn this directly from the huge amount of video data available online which is often unorganized and contains a lot of noise.
What's the solution?
The researchers created FOFPred, which combines a Vision-Language Model (VLM) – something that understands both images and text – with a Diffusion model – a type of generative model good at creating realistic images. They trained this model on a massive dataset of videos and captions from the internet. To handle the noisy data, they used special techniques to clean it up and leveraged strong initial training of the image processing part of the model. This allows FOFPred to predict future motion based on text prompts.
Why it matters?
This work is important because it shows that we can learn to predict future motion effectively from large, readily available datasets. The model’s ability to work in different areas, like controlling robots and creating realistic video, demonstrates the power of combining vision, language, and advanced generative techniques. It opens the door to more intelligent systems that can understand and interact with the world around them.
Abstract
Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction. Our model is trained on web-scale human activity data-a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining. The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.