Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Ning Ding, Youbang Sun, Biqing Qi, Yuchen Fan, Xue Kai Zhu, Bowen Zhou

2024-12-25

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Summary

This paper talks about Fourier Position Embedding (FoPE), a new method designed to improve how language models understand and process longer texts by enhancing their attention mechanisms.

What's the problem?

Language models often struggle with longer texts because their methods for recognizing the order of words (called position embeddings) can become less effective. Current techniques, like Rotary Position Embedding (RoPE), have limitations that affect how well these models can generalize and maintain accuracy when dealing with longer sequences of information.

What's the solution?

The authors propose FoPE, which improves on existing methods by using concepts from signal processing. FoPE treats each part of the model as having multiple frequencies instead of just one, allowing it to better capture the details in longer texts. It also removes harmful frequency components that could confuse the model. By doing this, FoPE helps the language model maintain stable performance and accuracy even when processing complex or lengthy information.

Why it matters?

This research is important because it addresses a key challenge in making AI language models more effective for real-world applications, where they often need to handle long documents or conversations. By improving how these models process information, FoPE could lead to better performance in tasks like reading comprehension, summarization, and more, making AI tools more useful for students, researchers, and professionals.

Abstract

Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While existing works mainly address RoPE's limitations within attention mechanism, this paper provides an analysis across nearly all parts of LMs, uncovering their adverse effects on length generalization for RoPE-based attention. Using Discrete Signal Processing theory, we show that RoPE enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose Fourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs Fourier Series and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales show that, within varying context windows, FoPE can maintain a more stable perplexity and a more consistent accuracy in a needle-in-haystack task compared to RoPE and ALiBi. Several analyses and ablations bring further support to our method and theoretical modeling.

View Paper