Dr.LLM: Dynamic Layer Routing in LLMs

Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh

2025-10-15

Summary

This paper introduces Dr.LLM, a new way to make large language models (LLMs) more efficient without sacrificing their accuracy. It focuses on intelligently deciding which parts of the LLM need to be used for each specific question or task.

What's the problem?

Currently, LLMs process every single piece of information (called a token) through *all* of their layers, even if the task is simple. This is a waste of computing power. While some methods try to adapt how many layers are used, they often require a lot of extra work like searching for the best setup during use, changing the model's basic structure, or completely retraining it, and they don't always even improve accuracy.

What's the solution?

Dr.LLM adds small 'routers' to each layer of an existing, pre-trained LLM. These routers decide whether to skip a layer, run it, or even repeat it. The routers are trained using a technique called Monte Carlo Tree Search, which helps them learn the best layer configurations to use for different tasks while staying within a certain computing budget. The researchers also included specific design choices, like a way to handle uneven data and long inputs, to make the routers more reliable.

Why it matters?

This work is important because it shows you can make LLMs significantly faster and more efficient *without* needing to change the core model or retrain it from scratch. Dr.LLM improves accuracy on some tasks, maintains accuracy on others, and works well even when faced with new, unseen problems. This means we can potentially use powerful LLMs more easily and affordably.

Abstract

Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr.LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, Dr.LLM shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights.

View Paper