Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models
Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Yang You, Guiming Xie, Xuejian Gong, Kunlong Zhou
2025-02-20
Summary
This paper talks about LoRAM, a new way to train large AI language models using less computer memory. It's like teaching a smaller version of the AI and then using what it learned to make the big version smarter, without needing as much expensive computer equipment.
What's the problem?
Big AI language models are really good at understanding and generating text, but they need a lot of computer memory to be trained or improved. Even when using a method called LoRA to make training cheaper, the original huge AI model still takes up a lot of memory, which makes it hard for researchers without access to powerful computers to work on these AIs.
What's the solution?
The researchers created LoRAM, which works by first making a smaller version of the big AI model by removing parts that aren't as important for training. They train this smaller version, and then use what it learned to improve the full-size AI. They also do some extra pre-training to make sure the smaller and larger versions work well together. This method allows them to train a huge AI model with 70 billion parts using much less powerful computers than normally needed.
Why it matters?
This matters because it makes working with cutting-edge AI more accessible to researchers who don't have super expensive computers. It could lead to more people being able to improve and customize these powerful AI models, potentially speeding up AI research and making it possible to create better AI assistants, translation tools, and other language-based technologies without needing a room full of high-end computers.
Abstract
Large Language Models (LLMs) have significantly advanced natural language processing with exceptional task generalization capabilities. Low-Rank Adaption (LoRA) offers a cost-effective fine-tuning solution, freezing the original model parameters and training only lightweight, low-rank adapter matrices. However, the memory footprint of LoRA is largely dominated by the original model parameters. To mitigate this, we propose LoRAM, a memory-efficient LoRA training scheme founded on the intuition that many neurons in over-parameterized LLMs have low training utility but are essential for inference. LoRAM presents a unique twist: it trains on a pruned (small) model to obtain pruned low-rank matrices, which are then recovered and utilized with the original (large) model for inference. Additionally, minimal-cost continual pre-training, performed by the model publishers in advance, aligns the knowledge discrepancy between pruned and original models. Our extensive experiments demonstrate the efficacy of LoRAM across various pruning strategies and downstream tasks. For a model with 70 billion parameters, LoRAM enables training on a GPU with only 20G HBM, replacing an A100-80G GPU for LoRA training and 15 GPUs for full fine-tuning. Specifically, QLoRAM implemented by structured pruning combined with 4-bit quantization, for LLaMA-3.1-70B (LLaMA-2-70B), reduces the parameter storage cost that dominates the memory usage in low-rank matrix training by 15.81times (16.95times), while achieving dominant performance gains over both the original LLaMA-3.1-70B (LLaMA-2-70B) and LoRA-trained LLaMA-3.1-8B (LLaMA-2-13B).