Hyper-Connections

Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, Xun Zhou

2024-10-01

Summary

This paper introduces hyper-connections, a new method designed to improve the way neural networks connect layers, offering a better alternative to traditional residual connections.

What's the problem?

Residual connections, which are commonly used in neural networks, can sometimes lead to issues like gradient vanishing (where the model struggles to learn) and representation collapse (where the model loses important information). These problems can make training large models less effective.

What's the solution?

Hyper-connections tackle these issues by allowing the network to adjust how strongly different layers are connected and rearranging them as needed. This flexibility helps improve the model's performance during training. The authors tested hyper-connections in large language models and found that they significantly outperformed traditional residual connections. They also observed similar benefits in vision tasks.

Why it matters?

This research is important because it provides a new way to enhance neural network training, potentially leading to better performance in various AI applications. By improving how models learn and represent information, hyper-connections could help advance fields like natural language processing and computer vision.

Abstract

We present hyper-connections, a simple yet effective method that can serve as an alternative to residual connections. This approach specifically addresses common drawbacks observed in residual connection variants, such as the seesaw effect between gradient vanishing and representation collapse. Theoretically, hyper-connections allow the network to adjust the strength of connections between features at different depths and dynamically rearrange layers. We conduct experiments focusing on the pre-training of large language models, including dense and sparse models, where hyper-connections show significant performance improvements over residual connections. Additional experiments conducted on vision tasks also demonstrate similar improvements. We anticipate that this method will be broadly applicable and beneficial across a wide range of AI problems.

View Paper