One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation

Fabian Paischer, Lukas Hauzenberger, Thomas Schmied, Benedikt Alkin, Marc Peter Deisenroth, Sepp Hochreiter

2024-10-10

One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation

Summary

This paper introduces a new method called Explained Variance Adaptation (EVA) for fine-tuning large language models (LLMs) more effectively by improving how model weights are initialized.

What's the problem?

When fine-tuning foundation models (FMs) that have been pre-trained on large datasets, the traditional methods often use random initialization for new weights. This can lead to slow learning and poor performance because the models might not adapt well to new tasks. Current approaches either focus on how to initialize weights or how to learn during training, but they haven't combined these ideas effectively.

What's the solution?

The authors propose EVA, which initializes the new weights based on data rather than randomly. They use a technique called singular value decomposition (SVD) to analyze small batches of data and determine the best way to set up these weights. This helps the model learn faster and perform better across various tasks, such as language generation and image classification. The method allows for better distribution of ranks among weight matrices, leading to improved performance during fine-tuning.

Why it matters?

This research is important because it offers a more efficient way to fine-tune large language models, which are used in many applications like chatbots and image recognition systems. By enhancing the initialization process, EVA can help these models adapt more quickly and effectively, ultimately leading to better performance in real-world scenarios.

Abstract

Foundation models (FMs) are pre-trained on large-scale datasets and then fine-tuned on a downstream task for a specific application. The most successful and most commonly used fine-tuning method is to update the pre-trained weights via a low-rank adaptation (LoRA). LoRA introduces new weight matrices that are usually initialized at random with a uniform rank distribution across model weights. Recent works focus on weight-driven initialization or learning of adaptive ranks during training. Both approaches have only been investigated in isolation, resulting in slow convergence or a uniform rank distribution, in turn leading to sub-optimal performance. We propose to enhance LoRA by initializing the new weights in a data-driven manner by computing singular value decomposition on minibatches of activation vectors. Then, we initialize the LoRA matrices with the obtained right-singular vectors and re-distribute ranks among all weight matrices to explain the maximal amount of variance and continue the standard LoRA fine-tuning procedure. This results in our new method Explained Variance Adaptation (EVA). We apply EVA to a variety of fine-tuning tasks ranging from language generation and understanding to image classification and reinforcement learning. EVA exhibits faster convergence than competitors and attains the highest average score across a multitude of tasks per domain.

View Paper