AERO: Softmax-Only LLMs for Efficient Private Inference

Nandan Kumar Jha, Brandon Reagen

2024-10-18

AERO: Softmax-Only LLMs for Efficient Private Inference

Summary

This paper introduces AERO, a new method for improving the efficiency of language models when processing encrypted data, making it easier to keep user information private.

What's the problem?

As language models become more common, there are growing concerns about privacy because these models often handle sensitive user data. Current methods for private inference (PI), where data is processed without revealing it, can be slow and require a lot of communication due to complex calculations called nonlinear operations. This makes it difficult to use these models effectively in real-world applications.

What's the solution?

To tackle these issues, the authors developed AERO, which simplifies the architecture of language models by removing nonlinear components like LayerNorm and GELU. Instead, they focus on a Softmax-only approach that reduces the number of calculations needed (measured in FLOPs). This change allows the model to process data more quickly and efficiently while maintaining good performance. They also introduced a new technique called entropy regularization to further enhance the model's effectiveness. AERO can significantly reduce communication needs and processing time compared to previous methods.

Why it matters?

This research is important because it helps make language models more efficient and better suited for handling sensitive information without compromising performance. By improving private inference, AERO can facilitate safer use of AI technologies in various applications, such as healthcare and finance, where protecting user data is crucial.

Abstract

The pervasiveness of proprietary language models has raised privacy concerns for users' sensitive data, emphasizing the need for private inference (PI), where inference is performed directly on encrypted inputs. However, current PI methods face prohibitively higher communication and latency overheads, primarily due to nonlinear operations. In this paper, we present a comprehensive analysis to understand the role of nonlinearities in transformer-based decoder-only language models. We introduce AERO, a four-step architectural optimization framework that refines the existing LLM architecture for efficient PI by systematically removing nonlinearities such as LayerNorm and GELU and reducing FLOPs counts. For the first time, we propose a Softmax-only architecture with significantly fewer FLOPs tailored for efficient PI. Furthermore, we devise a novel entropy regularization technique to improve the performance of Softmax-only models. AERO achieves up to 4.23times communication and 1.94times latency reduction. We validate the effectiveness of AERO by benchmarking it against the state-of-the-art.

View Paper