DarwinLM: Evolutionary Structured Pruning of Large Language Models

Shengkun Tang, Oliver Sieberling, Eldar Kurtic, Zhiqiang Shen, Dan Alistarh

2025-02-17

DarwinLM: Evolutionary Structured Pruning of Large Language Models

Summary

This paper talks about DarwinLM, a new method to make large AI language models smaller and faster without losing much of their ability to understand and generate text. It's named after Charles Darwin because it uses ideas similar to natural selection to figure out which parts of the AI model to keep and which to remove.

What's the problem?

Big AI language models are really good at understanding and creating text, but they're also huge and need a lot of computing power to run. This makes them hard to use in real-world applications, especially when you need quick responses. Also, some parts of these models are more important than others, so just cutting everything equally doesn't work well.

What's the solution?

The researchers created DarwinLM, which works like a smart trimming tool for AI models. It creates many smaller versions of the original model, tests them to see which ones work best, and then combines the best parts to make even better versions. This process repeats over several 'generations,' just like in evolution. DarwinLM also does a quick training check on these smaller models to make sure they can still learn and improve after being trimmed down.

Why it matters?

This matters because it could make powerful AI language models much more practical to use in everyday applications. By making these models smaller and faster without losing much of their ability, DarwinLM could help bring advanced AI text understanding and generation to more devices and situations where quick responses are needed. It's also more efficient, needing less data to train the trimmed-down models, which could save time and resources in developing AI technologies.

Abstract

Large Language Models (LLMs) have achieved significant success across various NLP tasks. However, their massive computational costs limit their widespread use, particularly in real-time applications. Structured pruning offers an effective solution by compressing models and directly providing end-to-end speed improvements, regardless of the hardware environment. Meanwhile, different components of the model exhibit varying sensitivities towards pruning, calling for non-uniform model compression. However, a pruning method should not only identify a capable substructure, but also account for post-compression training. To this end, we propose \sysname, a method for training-aware structured pruning. \sysname builds upon an evolutionary search process, generating multiple offspring models in each generation through mutation, and selecting the fittest for survival. To assess the effect of post-training, we incorporate a lightweight, multistep training process within the offspring population, progressively increasing the number of tokens and eliminating poorly performing models in each selection stage. We validate our method through extensive experiments on Llama-2-7B, Llama-3.1-8B and Qwen-2.5-14B-Instruct, achieving state-of-the-art performance for structured pruning. For instance, \sysname surpasses ShearedLlama while requiring 5times less training data during post-compression training.

View Paper