Cottention: Linear Transformers With Cosine Attention
Gabriel Mongaras, Trevor Dohm, Eric C. Larson
2024-10-01

Summary
This paper presents Cottention, a new attention mechanism for transformers that replaces the traditional softmax attention with cosine similarity to improve memory efficiency.
What's the problem?
Traditional softmax attention in transformer models requires a lot of memory, especially when processing long sequences of data. This quadratic memory usage makes it difficult for these models to handle longer inputs effectively, leading to limitations in their performance.
What's the solution?
Cottention solves this problem by using cosine similarity instead of softmax, which allows it to achieve linear memory complexity. This means that as the sequence length increases, the amount of memory used does not increase as dramatically as it does with softmax. The authors also show that Cottention can be treated like a recurrent neural network (RNN), which helps maintain a constant memory usage during processing. They conducted experiments showing that Cottention performs comparably to softmax attention while using significantly less memory.
Why it matters?
This research is important because it enables transformer models to process longer sequences without running into memory issues. By improving memory efficiency, Cottention could enhance the capabilities of AI systems in tasks like natural language processing, allowing them to work with more complex and lengthy inputs.
Abstract
Attention mechanisms, particularly softmax attention, have been instrumental in the success of transformer-based models such as GPT. However, the quadratic memory complexity of softmax attention with respect to sequence length poses significant challenges for processing longer sequences. We introduce Cottention, a novel attention mechanism that replaces the softmax operation with cosine similarity. By leveraging the properties of cosine similarity and rearranging the attention equation, Cottention achieves native linear memory complexity with respect to sequence length, making it inherently more memory-efficient than softmax attention. We demonstrate that Cottention can be reformulated as a recurrent neural network (RNN) with a finite hidden state, allowing for constant memory usage during inference. We evaluate Cottention on both the bidirectional BERT and causal GPT tasks, demonstrating comparable performance to softmax attention while significantly reducing memory requirements. To ensure efficient computation, we develop a custom CUDA kernel for Cottention. Our results show that Cottention is a promising alternative to softmax attention, enabling the processing of longer sequences without sacrificing performance, due to its native linear memory complexity and ability to maintain a constant memory footprint during inference.