Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence

Sean McLeish, Ang Li, John Kirchenbauer, Dayal Singh Kalra, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Jonas Geiping, Tom Goldstein, Micah Goldblum

2025-11-11

Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence

Summary

This paper explores a way to make large language models, like those used for generating text, more efficient without losing their ability to perform well, specifically focusing on math problems.

What's the problem?

Large language models are incredibly powerful, but they require a huge amount of computing power both when they are being trained and when they are being used. This makes them expensive and limits who can access and utilize them. The goal is to reduce the computing power needed during use without sacrificing the model's accuracy.

What's the solution?

The researchers took existing language models that weren't designed to handle information over long sequences and modified them to be 'depth-recurrent'. This means they process information in a way that allows them to remember and use earlier parts of a text, like solving a multi-step math problem. They didn't just change the model all at once, but instead gradually increased its ability to 'recur' or remember information during training, which helped maintain performance while reducing the overall computational cost.

Why it matters?

This work is important because it shows a promising method for making powerful language models more accessible and affordable. By reducing the computing power needed to run these models, especially for tasks like mathematics where remembering previous steps is crucial, it opens the door for wider use and further advancements in the field.

Abstract

Recent advances in depth-recurrent language models show that recurrence can decouple train-time compute and parameter count from test-time compute. In this work, we study how to convert existing pretrained non-recurrent language models into depth-recurrent models. We find that using a curriculum of recurrences to increase the effective depth of the model over the course of training preserves performance while reducing total computational cost. In our experiments, on mathematics, we observe that converting pretrained models to recurrent ones results in better performance at a given compute budget than simply post-training the original non-recurrent language model.

View Paper