Rethinking Visual Intelligence: Insights from Video Pretraining

Pablo Acuaviva, Aram Davtyan, Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Alexandre Alahi, Paolo Favaro

2025-10-29

Rethinking Visual Intelligence: Insights from Video Pretraining

Summary

This research explores whether training AI on videos, instead of just text, can lead to more versatile and capable visual intelligence, similar to how large language models excel at language tasks.

What's the problem?

While large language models are really good at understanding and adapting to new language-based problems, the same isn't true for AI dealing with images and videos. Current visual AI struggles with understanding how different parts of a scene relate to each other, learning from limited examples, and solving a wide variety of visual tasks effectively. Basically, visual AI hasn't reached the same level of general intelligence as language AI.

What's the solution?

The researchers investigated Video Diffusion Models, which are AI systems trained on lots of video data. They compared these models to large language models by giving both of them simple add-ons that allowed them to tackle different visual challenges like reasoning about physics, understanding concepts, playing games, planning routes, and predicting patterns. They wanted to see if the way video models are built gives them an advantage in learning new visual skills.

Why it matters?

The findings suggest that training AI on videos provides a strong foundation for visual understanding. Video pretraining seems to give AI the right 'built-in' assumptions about how the world works, allowing it to learn new visual tasks more easily and efficiently than models trained only on images or text. This is a step towards creating 'foundation models' for vision – AI systems that can handle a broad range of visual problems with minimal additional training.

Abstract

Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.

View Paper