Time Blindness: Why Video-Language Models Can't See What Humans Can?
Ujjwal Upadhyay, Mukul Ranjan, Zhiqiang Shen, Mohamed Elhoseiny
2025-06-02

Summary
This paper talks about SpookyBench, a new test designed to see how well AI models that understand both video and language can recognize patterns over time in videos, especially when the frames look random and don't have clear shapes or objects.
What's the problem?
The problem is that current video-language AI models struggle to notice and understand changes or patterns in videos when there isn't obvious spatial information, like recognizable objects or scenes, which makes them miss things that humans can easily spot.
What's the solution?
The researchers created SpookyBench, which challenges these AI models with videos made up of noise-like frames that don't have clear objects or backgrounds. This test shows where the models fail to pick up on important time-based patterns that humans can still recognize.
Why it matters?
This is important because it reveals a big weakness in how AI understands videos, which needs to be fixed for the technology to be more useful in real-world situations like security, sports analysis, or any task that depends on understanding how things change over time.
Abstract
SpookyBench is a benchmark for temporal pattern recognition in videos that highlights the limitations of vision-language models in processing noise-like frames without spatial information.