When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training

Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji Kawaguchi, Tianyu Pang

2024-11-21

When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training

Summary

This paper discusses how to improve the performance of large language models (LLMs) when processing long sequences of text by addressing issues that arise from using a specific data format called BFloat16 with Rotary Positional Embedding (RoPE).

What's the problem?

As LLMs are trained to handle longer contexts (more text at once), they encounter problems when using RoPE with BFloat16. This combination leads to numerical errors that worsen as the length of the text increases, especially due to the first token's influence on position calculations. These errors can make the model less effective at understanding and generating text.

What's the solution?

To solve this problem, the authors developed a new method called AnchorAttention. This method improves how attention is calculated in LLMs by treating the first token in a sequence as a consistent reference point (or anchor) for all other tokens. This approach reduces unnecessary calculations and keeps the model's understanding intact while speeding up training by over 50% compared to traditional methods.

Why it matters?

This research is important because it allows LLMs to process longer pieces of text more accurately and efficiently. By overcoming the limitations of BFloat16 and RoPE, AnchorAttention enhances the capabilities of these models, making them better suited for complex tasks in real-world applications like chatbots, content generation, and more.

Abstract

Extending context window sizes allows large language models (LLMs) to process longer sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has become the de facto standard due to its relative positional encoding properties that benefit long-context training. However, we observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding, especially in long-context scenarios. This issue arises from BFloat16's limited precision and accumulates as context length increases, with the first token contributing significantly to this problem. To address this, we develop AnchorAttention, a plug-and-play attention method that alleviates numerical issues caused by BFloat16, improves long-context capabilities, and speeds up training. AnchorAttention reduces unnecessary attention computations, maintains semantic coherence, and boosts computational efficiency by treating the first token as a shared anchor with a consistent position ID, making it visible to all documents within the training context. Experiments on three types of LLMs demonstrate that AnchorAttention significantly improves long-context performance and reduces training time by over 50\% compared to standard full attention mechanisms, while preserving the original LLM's capabilities on general tasks. Our code is available at https://github.com/haonan3/AnchorContext.

View Paper