Differential Transformer

Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei

2024-10-08

Summary

This paper introduces the Differential Transformer, a new model that improves how attention is focused on relevant information while ignoring unnecessary details, leading to better performance in various language tasks.

What's the problem?

Traditional Transformer models often pay too much attention to irrelevant information in the input data, which can confuse them and lead to poorer performance. This issue is especially problematic when trying to understand long texts or complex questions, as it can cause the model to generate inaccurate or nonsensical responses.

What's the solution?

To address this, the authors developed the Differential Transformer, which uses a unique differential attention mechanism. This method calculates attention scores by comparing two separate attention maps and subtracting one from the other. By doing this, it effectively cancels out noise and helps the model focus more on important context. The researchers found that this approach not only improves accuracy in tasks like question answering and summarization but also makes the model more robust when dealing with different orders of information.

Why it matters?

This research is important because it shows a way to enhance the capabilities of large language models, making them more effective at understanding and generating text. By reducing distractions from irrelevant context, the Differential Transformer can provide clearer and more accurate results in real-world applications, such as virtual assistants and automated customer support.

Abstract

Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.

View Paper