EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search
Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, Dan Alistarh
2024-10-23
Summary
This paper presents EvoPress, a new method for compressing large language models (LLMs) to make them smaller and more efficient while maintaining their performance.
What's the problem?
Large language models can be very expensive to run because they require a lot of computing power. Current methods for compressing these models often rely on rules that don't always work well, leading to models that might not perform as accurately after being compressed.
What's the solution?
The authors propose EvoPress, which uses a new approach called evolutionary search to optimize how the model is compressed. This method allows for different levels of compression for different parts of the model, ensuring that important areas are preserved while still reducing overall size. EvoPress has been shown to work effectively with various LLMs, achieving better performance than previous methods.
Why it matters?
This research is important because it helps make advanced AI models more accessible by reducing their resource needs. By improving how we compress these models, we can deploy them in more applications, making powerful AI technology available to a wider range of users and devices.
Abstract
The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by dynamic, non-uniform compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on heuristics for identifying the "importance" of a given layer towards the loss, based on assumptions such as error monotonicity, i.e. that the end-to-end model compression error is proportional to the sum of layer-wise errors. In this paper, we revisit this area, and propose a new and general approach for dynamic compression that is provably optimal in a given input range. We begin from the motivating observation that, in general, error monotonicity does not hold for LLMs: compressed models with lower sum of per-layer errors can perform worse than models with higher error sums. To address this, we propose a new general evolutionary framework for dynamic LLM compression called EvoPress, which has provable convergence, and low sample and evaluation complexity. We show that these theoretical guarantees lead to highly competitive practical performance for dynamic compression of Llama, Mistral and Phi models. Via EvoPress, we set new state-of-the-art results across all compression approaches: structural pruning (block/layer dropping), unstructured sparsity, as well as quantization with dynamic bitwidths. Our code is available at https://github.com/IST-DASLab/EvoPress.