Seeing Fast and Slow: Learning the Flow of Time in Videos
Yen-Siang Wu, Rundong Luo, Jingsen Zhu, Tao Tu, Ali Farhadi, Matthew Wallingford, Yu-Chiang Frank Wang, Steve Marschner, Wei-Chiu Ma
2026-04-24
Summary
This research explores how computers can understand and change the speed of videos, essentially learning to 'see' and control time within video footage.
What's the problem?
Currently, computer vision focuses a lot on *what* is in a video, but not *when* things happen or how fast they're happening. It's difficult for computers to automatically tell if a video has been sped up or slowed down, and even harder to create videos at specific speeds or improve the detail of slow-motion footage.
What's the solution?
The researchers developed a system that learns to recognize speed changes in videos by looking at multiple clues – things like how objects move and the overall structure of the video. This allowed them to create a large collection of high-quality slow-motion videos. Then, they used this data to build models that can generate videos at different speeds and even take blurry, low-frame-rate videos and turn them into smooth, high-frame-rate slow-motion videos with more detail.
Why it matters?
This work is important because understanding and controlling time in videos opens up possibilities for things like detecting if a video has been tampered with (like speeding it up to hide something), creating more realistic and controllable video content, and building computer systems that have a better understanding of how events unfold in the real world.
Abstract
How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.