Tensor Product Attention Is All You Need
Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao
2025-01-14

Summary
This paper talks about a new way to make AI language models work better with less computer memory. The researchers created something called Tensor Product Attention (TPA) that helps AI understand and process longer pieces of text without needing as much memory space.
What's the problem?
Big AI language models are really good at understanding and creating text, but they need a lot of computer memory to work with long pieces of writing. This makes it hard to use them for tasks that involve lots of text, like analyzing long documents or having long conversations.
What's the solution?
The researchers came up with a clever math trick called Tensor Product Attention (TPA). It's like a special way of folding information to take up less space. They used this to create a new type of AI model called T6. This model can understand long pieces of text using much less memory than older models. They tested T6 on lots of language tasks and found that it works better than other popular AI models while using less memory.
Why it matters?
This research matters because it could make AI language models much more useful in real-world situations. With less memory needed, these AIs could work on phones or smaller computers, not just big servers. It also means AIs could handle much longer texts, which is important for things like analyzing big documents, having long conversations, or understanding complex stories. This could lead to smarter AI assistants, better research tools, and more advanced language understanding in many fields.
Abstract
Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, significantly shrinking KV cache size at inference time. By factorizing these representations into contextual low-rank components (contextual factorization) and seamlessly integrating with RoPE, TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture for sequence modeling. Through extensive empirical evaluation of language modeling tasks, we demonstrate that T6 exceeds the performance of standard Transformer baselines including MHA, MQA, GQA, and MLA across various metrics, including perplexity and a range of renowned evaluation benchmarks. Notably, TPAs memory efficiency enables the processing of significantly longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. The code is available at https://github.com/tensorgi/T6.