Scalable-Softmax Is Superior for Attention

Ken M. Nakanishi

2025-02-03

Scalable-Softmax Is Superior for Attention

Summary

This paper talks about a new method called Scalable-Softmax (SSMax) that helps AI language models pay attention to important information better, especially when dealing with long pieces of text.

What's the problem?

Current AI language models use something called Softmax to decide what parts of a text are important. But as texts get longer, Softmax starts to treat everything as equally important, which means the AI can't focus on the key information. It's like trying to find the main point in a really long essay where every sentence looks equally important.

What's the solution?

The researchers created SSMax to replace Softmax. SSMax is designed to work better with texts of different lengths. It helps the AI model keep focusing on the most important parts, even in long texts. They tested SSMax by using it in AI models for language tasks. These models learned faster and did better at understanding long texts and finding key information. The researchers also found that even AI models that had already been trained could improve by switching to SSMax.

Why it matters?

This matters because it could make AI language models much better at understanding and working with long pieces of text. This could help in many areas, like improving search engines, creating better virtual assistants, or helping with research that involves analyzing lots of documents. It's a step towards AI that can handle more complex and lengthy information, which is crucial as we want AI to help with more sophisticated tasks in the real world.

Abstract

The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to flatten as the context size grows. This reduces the model's ability to prioritize key information effectively and potentially limits its length generalization. To address this problem, we propose Scalable-Softmax (SSMax), which replaces Softmax in scenarios where the input vector size varies. SSMax can be seamlessly integrated into existing Transformer-based architectures. Experimental results in language modeling show that models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts and key information retrieval. Furthermore, an analysis of attention scores reveals that SSMax enables the model to focus attention on key information even in long contexts. Additionally, although models that use SSMax from the beginning of pretraining achieve better length generalization, those that have already started pretraining can still gain some of this ability by replacing Softmax in the attention layers with SSMax, either during or after pretraining.

View Paper