Higher-order Linear Attention

Yifan Zhang, Zhen Qin, Quanquan Gu

2025-11-03

Summary

This paper introduces a new way to handle attention in language models, called Higher-order Linear Attention (HLA), which aims to make processing long pieces of text more efficient without sacrificing accuracy.

What's the problem?

Traditional attention mechanisms, which are crucial for language models to understand relationships between words, become incredibly slow and require a lot of memory when dealing with long texts. While faster alternatives like linear attention and State Space Models exist, they often simplify things too much, limiting the model's ability to capture complex patterns in the data. Essentially, existing methods trade speed for understanding.

What's the solution?

The researchers developed HLA, which cleverly uses a technique to represent interactions between words in a more compact way. It maintains a small amount of information (a 'state') and updates it as it processes each word, allowing it to handle long sequences quickly – specifically, in linear time. They also provide a way to make sure the process only looks at past information (causality) and a method to speed up training by processing chunks of text in parallel. They even suggest how to extend this idea to even more complex interactions.

Why it matters?

HLA is important because it offers a promising balance between speed and accuracy for language models. It could enable models to process much longer texts, like entire books or articles, more effectively, leading to better performance in tasks like translation, summarization, and question answering. It provides a new building block for creating more powerful and efficient language processing systems.

Abstract

The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any n times n matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: https://github.com/yifanzhang-pro/HLA.

View Paper