Return of the Encoder: Maximizing Parameter Efficiency for SLMs
Mohamed Elfeki, Rui Liu, Chad Voegele
2025-01-28
Summary
This paper talks about how two different ways of training AI models, called Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), affect how well these models can learn and apply new information. The researchers tested these methods using a card game and a navigation task to see which one helps AI better understand and use new rules or situations.
What's the problem?
AI models are getting really good at specific tasks, but they often struggle when faced with new situations that are different from what they were trained on. Researchers weren't sure which training method, SFT or RL, was better at helping AI adapt to these new situations. They wanted to find out if one method just made the AI memorize information, while the other helped it truly understand and apply what it learned.
What's the solution?
The researchers created two test scenarios: a card game called GeneralPoints that tests math skills, and a navigation task called V-IRL. They trained AI models using both SFT and RL methods, then changed the rules or environment to see how well the models could adapt. They found that RL-trained models were much better at figuring out new situations, both in the card game and the navigation task. SFT-trained models, however, tended to just remember the exact situations they were trained on and struggled with changes.
Why it matters?
This research matters because it helps us create smarter, more flexible AI systems. If we use RL methods, we might be able to build AI that can handle real-world problems better, even when faced with new situations it hasn't seen before. This could lead to more useful AI in areas like self-driving cars, robots, or virtual assistants that can adapt to different user needs. However, the study also shows that SFT is still important as a first step before using RL, which could help developers create more effective training processes for AI models.
Abstract
The dominance of large decoder-only language models has overshadowed encoder-decoder architectures, despite their fundamental efficiency advantages in sequence processing. For small language models (SLMs) - those with 1 billion parameters or fewer - our systematic analysis across GPU, CPU, and NPU platforms reveals that encoder-decoder architectures achieve 47% lower first-token latency and 4.7x higher throughput compared to decoder-only models on edge devices. These gains may be attributed to encoder-decoder's one-time input processing and efficient separation of understanding and generation phases. We introduce a novel knowledge distillation framework that enables encoder-decoder models to leverage capabilities from large scalable decoder-only teachers while preserving their architectural advantages, achieving up to 6 average performance points improvement across diverse tasks, with significant gains in asymmetric sequence tasks where input and output distributions can benefit from different processing approaches. When combined with modern advances like Rotary Positional Embeddings (RoPE) and Vision encoders, our systematic investigation demonstrates that encoder-decoder architectures provide a more practical path toward deploying capable language models in resource-constrained environments. Our findings challenge the prevailing trend toward decoder-only scaling, showing that architectural choices become increasingly crucial as parameter budgets decrease, particularly for on-device and edge deployments where computational efficiency is paramount.