Addition is All You Need for Energy-efficient Language Models

Hongyin Luo, Wei Sun

2024-10-07

Addition is All You Need for Energy-efficient Language Models

Summary

This paper introduces the L-Mul algorithm, which improves the energy efficiency of large neural networks by replacing complex floating-point multiplications with simpler integer additions, achieving significant reductions in energy consumption while maintaining high precision.

What's the problem?

Large neural networks, which are crucial for artificial intelligence, consume a lot of energy primarily due to the complex calculations involved in floating-point multiplications. This high energy demand is a major concern as AI technology continues to grow, leading to increased operational costs and environmental impact. Finding ways to reduce this energy consumption without sacrificing performance is essential.

What's the solution?

The authors propose the L-Mul algorithm, which approximates floating-point multiplication using integer addition. This method significantly reduces the computational resources needed compared to traditional floating-point operations. The L-Mul algorithm has been shown to cut energy costs by up to 95% for element-wise tensor multiplications and about 80% for dot products. It also maintains or even improves the accuracy of AI models when applied, making it a practical solution for enhancing energy efficiency in AI computations.

Why it matters?

This research is important because it addresses the growing energy crisis associated with AI technologies. By providing a more efficient way to perform necessary calculations, L-Mul not only helps reduce operational costs but also contributes to making AI development more sustainable. As AI continues to expand into various fields, solutions like L-Mul will be crucial for ensuring that progress does not come at an unsustainable environmental cost.

Abstract

Large neural networks spend most computation on floating point tensor multiplications. In this work, we find that a floating point multiplier can be approximated by one integer adder with high precision. We propose the linear-complexity multiplication L-Mul algorithm that approximates floating point number multiplication with integer addition operations. The new algorithm costs significantly less computation resource than 8-bit floating point multiplication but achieves higher precision. Compared to 8-bit floating point multiplications, the proposed method achieves higher precision but consumes significantly less bit-level computation. Since multiplying floating point numbers requires substantially higher energy compared to integer addition operations, applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by element-wise floating point tensor multiplications and 80% energy cost of dot products. We calculated the theoretical error expectation of L-Mul, and evaluated the algorithm on a wide range of textual, visual, and symbolic tasks, including natural language understanding, structural reasoning, mathematics, and commonsense question answering. Our numerical analysis experiments agree with the theoretical error estimation, which indicates that L-Mul with 4-bit mantissa achieves comparable precision as float8_e4m3 multiplications, and L-Mul with 3-bit mantissa outperforms float8_e5m2. Evaluation results on popular benchmarks show that directly applying L-Mul to the attention mechanism is almost lossless. We further show that replacing all floating point multiplications with 3-bit mantissa L-Mul in a transformer model achieves equivalent precision as using float8_e4m3 as accumulation precision in both fine-tuning and inference.

View Paper