ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

Mohsen Ghafoorian, Amirhossein Habibian

2026-01-09

ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

Summary

This paper introduces a new way to build video generation models, focusing on making them more efficient and able to create longer videos without needing massive amounts of computing power.

What's the problem?

Current state-of-the-art video generation models use a technique called 'attention' which allows the model to focus on different parts of the video when creating new frames. However, this attention process becomes incredibly slow and memory-intensive as the video gets longer, because the computational cost increases dramatically with each additional frame. This limits how long of a video these models can realistically create.

What's the solution?

The researchers developed a new 'attention' mechanism called ReHyAt. It's a hybrid approach, combining the best parts of two different attention methods – one that's very accurate but slow, and another that's fast but less accurate. This allows ReHyAt to be both efficient and produce high-quality videos. Importantly, they also figured out a way to 'teach' ReHyAt using existing, powerful models, which significantly reduces the time and resources needed to train it.

Why it matters?

This work is important because it unlocks the possibility of creating much longer, higher-quality videos with existing hardware. It makes video generation more practical for applications like on-device video editing or creating long-form content, and provides a blueprint for improving future video generation models without requiring enormous computational resources.

Abstract

Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt's hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation. Project page is available at https://qualcomm-ai-research.github.io/rehyat.

View Paper