SLA2: Sparse-Linear Attention with Learnable Routing and QAT

Jintao Zhang, Haoxu Wang, Kai Jiang, Kaiwen Zheng, Youhe Jiang, Ion Stoica, Jianfei Chen, Jun Zhu, Joseph E. Gonzalez

2026-02-19

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

Summary

This paper focuses on improving a technique called Sparse-Linear Attention (SLA), which is used to make creating videos with AI much faster. SLA is already pretty good, but this research aims to make it even more efficient and maintain the quality of the videos it generates.

What's the problem?

The original SLA method wasn't perfect. It decided which parts of the image processing to speed up based on a simple rule about how important different parts of the image seemed, which wasn't always the best choice. Also, the way SLA combined the 'sparse' and 'linear' attention methods didn't quite match up with the ideal mathematical way those methods should work together, leading to potential inaccuracies.

What's the solution?

The researchers developed SLA2, which addresses these issues in three main ways. First, instead of a simple rule, SLA2 *learns* which parts of the image processing should be sped up, making smarter decisions. Second, they created a more accurate way to combine the sparse and linear attention methods using a flexible ratio that the system also learns. Finally, they added a technique called 'low-bit attention' which further reduces the computational load by simplifying the calculations, while carefully minimizing any loss of quality.

Why it matters?

This research is important because it significantly speeds up the process of generating videos with AI. SLA2 achieves a huge speed increase – over 18 times faster in attention calculations – while still producing high-quality videos. This means AI video generation can become more accessible and practical for a wider range of applications, like creating special effects or generating content for social media.

Abstract

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.

View Paper