TRecViT: A Recurrent Video Transformer

Viorica Pătrăucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, Razvan Pascanu

2024-12-23

Summary

This paper talks about TRecViT, a new type of model designed to analyze and understand videos more effectively by combining different methods of processing information over time and space.

What's the problem?

Understanding videos is challenging because they contain a lot of information that changes over time. Traditional models often struggle with this complexity, either being too slow or not accurately capturing the important details in the video. They also tend to require a lot of memory and computational power, making them less efficient.

What's the solution?

TRecViT introduces a unique approach that separates how it processes information in three dimensions: time, space, and channels. It uses gated linear recurrent units (LRUs) to handle changes over time, self-attention layers to focus on different parts of the image, and multi-layer perceptrons (MLPs) for processing features. This combination allows TRecViT to be more efficient, using fewer resources while still achieving high accuracy in tasks like video classification and tracking.

Why it matters?

This research is significant because it improves how AI models can analyze videos, making them faster and more efficient. By optimizing the way information is processed, TRecViT can enhance applications in various fields such as surveillance, robotics, and entertainment, where understanding video content accurately and quickly is crucial.

Abstract

We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having 3times less parameters, 12times smaller memory footprint, and 5times lower FLOPs count. Code and checkpoints will be made available online at https://github.com/google-deepmind/trecvit.

View Paper