VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse
Ying Nie, Kai Han, Hongguang Li, Hang Zhou, Tianyu Guo, Enhua Wu, Xinghao Chen, Yunhe Wang
2025-12-17
Summary
This paper introduces a new way to make large language models more efficient, called VersatileFFN, without sacrificing their ability to understand and generate text.
What's the problem?
Large language models are incredibly powerful, but they require a huge amount of computer memory to run, making them expensive and difficult to use. Existing methods to reduce memory usage often just shrink down existing models, limiting how much they can actually learn and represent. They don't really *add* to the model's capabilities.
What's the solution?
VersatileFFN tackles this by creating a more flexible 'feed-forward network' – a key part of these models. It's designed to reuse the same core calculations in two different ways. One way quickly processes simpler parts of the text, like a fast lane. The other way repeatedly processes more complex parts, like taking a more detailed route. A 'gate' decides which path each piece of text should take, based on how difficult it is. Importantly, this doesn't add more parameters (the things the model learns), just more smart computation.
Why it matters?
This research is important because it offers a way to improve the efficiency of large language models without reducing their performance. This could make these powerful tools more accessible and affordable, allowing more people to use them and potentially leading to even more advanced AI applications.
Abstract
The rapid scaling of Large Language Models (LLMs) has achieved remarkable performance, but it also leads to prohibitive memory costs. Existing parameter-efficient approaches such as pruning and quantization mainly compress pretrained models without enhancing architectural capacity, thereby hitting the representational ceiling of the base model. In this work, we propose VersatileFFN, a novel feed-forward network (FFN) that enables flexible reuse of parameters in both width and depth dimensions within a fixed parameter budget. Inspired by the dual-process theory of cognition, VersatileFFN comprises two adaptive pathways: a width-versatile path that generates a mixture of sub-experts from a single shared FFN, mimicking sparse expert routing without increasing parameters, and a depth-versatile path that recursively applies the same FFN to emulate deeper processing for complex tokens. A difficulty-aware gating dynamically balances the two pathways, steering "easy" tokens through the efficient width-wise route and allocating deeper iterative refinement to "hard" tokens. Crucially, both pathways reuse the same parameters, so all additional capacity comes from computation rather than memory. Experiments across diverse benchmarks and model scales demonstrate the effectiveness of the method. The code will be available at https://github.com/huawei-noah/noah-research/tree/master/VersatileFFN.