TEMPO: Scaling Test-time Training for Large Reasoning Models

Qingyang Zhang, Xinke Kong, Haitao Wu, Qinghua Hu, Minghao Wu, Baosong Yang, Yu Cheng, Yun Luo, Ganqu Cui, Changqing Zhang

2026-04-22

TEMPO: Scaling Test-time Training for Large Reasoning Models

Summary

This paper introduces a new method called TEMPO to improve how large language models (LLMs) learn while they're being *used* – not just during their initial training. It's about making these models continuously get better at tasks even after they've left the lab, using the data they encounter in the real world.

What's the problem?

Current methods for letting LLMs learn on the fly, called 'test-time training,' quickly stop improving. They get stuck because the system used to judge how well the model is doing becomes unreliable as the model changes. Imagine trying to improve your basketball shot, but the person giving you feedback keeps changing their mind about what a good shot looks like – you wouldn't get much better! This also causes the model to start giving very similar answers to everything, losing its ability to be creative or handle different situations.

What's the solution?

TEMPO solves this by alternating between two steps. First, it lets the LLM practice and improve its answers on new, unlabeled questions. Second, and crucially, it periodically checks back in with a smaller, reliable set of *labeled* data to make sure the judging system (the 'critic') is still accurate. This is like occasionally asking a trusted coach to review your form and give you consistent feedback. The researchers realized this back-and-forth process is actually a well-known mathematical technique called Expectation-Maximization, and previous methods were missing this important 'check-in' step.

Why it matters?

This research is important because it allows LLMs to keep learning and improving throughout their lifespan, without needing constant retraining from scratch. This is especially useful in situations where you don't have a lot of labeled data, or where the task is constantly changing. The results show significant improvements on challenging reasoning tasks, meaning models can become much better at complex problem-solving, and they maintain a wider range of responses instead of getting stuck in a rut.

Abstract

Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.

View Paper