EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

Shih-Yang Liu, Huck Yang, Chein-Yi Wang, Nai Chit Fung, Hongxu Yin, Charbel Sakr, Saurav Muralidharan, Kwang-Ting Cheng, Jan Kautz, Yu-Chiang Frank Wang, Pavlo Molchanov, Min-Hung Chen

2024-10-29

EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

Summary

This paper introduces EoRA, a new method that helps improve the performance of compressed large language models (LLMs) without needing additional training.

What's the problem?

When large language models are compressed to save space and make them faster, they often lose some of their ability to generate accurate or coherent responses. This loss in performance can be a major issue, especially when trying to use these models in real-world applications where efficiency is important. Traditional methods to fix these issues usually require retraining the model, which can be time-consuming and resource-intensive.

What's the solution?

EoRA (Training-free Eigenspace Low-Rank Approximation) offers a solution by reformulating the problem of model compression into a compensation problem. Instead of retraining, EoRA introduces low-rank paths that help correct errors caused by compression. It uses a mathematical approach that focuses on the most important parts of the model's data (called eigenspace) to minimize errors without requiring extensive training. This allows for quick adjustments using only a small amount of data, making it efficient and effective.

Why it matters?

This research is significant because it enables the use of compressed large language models while maintaining their performance. EoRA allows these models to be deployed in situations where computational resources are limited, such as on mobile devices or in environments with strict efficiency requirements. By improving how we handle compressed models, EoRA can enhance many applications that rely on AI, making them faster and more accessible.

Abstract

In this work, we re-formulate the model compression problem into the customized compensation problem: Given a compressed model, we aim to introduce residual low-rank paths to compensate for compression errors under customized requirements from users (e.g., tasks, compression ratios), resulting in greater flexibility in adjusting overall capacity without being constrained by specific compression formats. However, naively applying SVD to derive residual paths causes suboptimal utilization of the low-rank representation capacity. Instead, we propose Training-free Eigenspace Low-Rank Approximation (EoRA), a method that directly minimizes compression-induced errors without requiring gradient-based training, achieving fast optimization in minutes using a small amount of calibration data. EoRA projects compression errors into the eigenspace of input activations, leveraging eigenvalues to effectively prioritize the reconstruction of high-importance error components. Moreover, EoRA can be seamlessly integrated with fine-tuning and quantization to further improve effectiveness and efficiency. EoRA consistently outperforms previous methods in compensating errors for compressed LLaMA2/3 models on various tasks, such as language generation, commonsense reasoning, and math reasoning tasks (e.g., 31.31%/12.88% and 9.69% improvements on ARC-Easy/ARC-Challenge and MathQA when compensating LLaMA3-8B that is quantized to 4-bit and pruned to 2:4 sparsity). EoRA offers a scalable, training-free solution to compensate for compression errors, making it a powerful tool to deploy LLMs in various capacity and efficiency requirements.

View Paper