Improving Recursive Transformers with Mixture of LoRAs

Mohammadmahdi Nouriborji, Morteza Rohanian, Omid Rohanian

2025-12-19

Improving Recursive Transformers with Mixture of LoRAs

Summary

This paper introduces a new way to build smaller, more efficient language models called Mixture of LoRAs, or MoL, and applies it to a revamped version of a recursive transformer architecture named ModernALBERT.

What's the problem?

Traditional recursive transformers, which are designed to process information efficiently, often sacrifice their ability to learn complex patterns when you try to make them smaller by sharing parameters across different layers. Sharing parameters reduces the model's size, but it can limit each layer's unique contribution to understanding the data, essentially making the model less expressive.

What's the solution?

The researchers tackled this problem by inserting small, adaptable modules called LoRA 'experts' into the shared parts of the transformer. These experts are activated differently depending on the input, allowing the model to adjust its behavior without changing the main, shared parameters. They also built a new, improved transformer architecture, ModernALBERT, with several modern techniques to boost performance. Finally, they developed a way to combine these experts into a single module for faster use after training.

Why it matters?

This work is important because it shows how to create powerful language models that are small enough to run on less powerful hardware. By restoring expressivity to parameter-shared transformers, MoL allows for state-of-the-art performance with significantly fewer parameters than larger, more resource-intensive models, making advanced language processing more accessible and efficient.

Abstract

Parameter sharing in recursive transformers reduces model size but collapses layer-wise expressivity. We propose Mixture of LoRAs (MoL), a lightweight conditional-computation mechanism that inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN). MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters, unlike prior approaches that add fixed or externally attached adapters. We pretrain a modernised recursive architecture, ModernALBERT, integrating rotary embeddings, GeGLU, FlashAttention, and a distillation-based initialisation. Across GLUE, SQuAD-v2, and BEIR, ModernALBERT (50M--120M) achieves state-of-the-art performance among compact models and surpasses larger fully parameterised baselines. We also propose an expert-merging procedure that compresses MoL into a single adapter at inference while preserving accuracy, enabling efficient deployment. Our results show that conditional weight-space modulation effectively restores the expressivity lost under aggressive parameter sharing in recursive transformers.

View Paper