Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning

Jonas Hübotter, Leander Diaz-Bone, Ido Hakimi, Andreas Krause, Moritz Hardt

2025-10-07

Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning

Summary

This paper explores whether a computer program can improve its skills by learning *while* it's being used, similar to how humans learn on the job.

What's the problem?

Typically, improving a computer model requires a lot of carefully prepared training data, which takes time and effort from people. The problem this paper addresses is how to get a model to continue learning and improving *without* needing humans to constantly create new training examples for it.

What's the solution?

The researchers created a system where the model itself decides what practice problems are most helpful to it *during* the task it's trying to solve. It's like the model builds its own personalized curriculum, choosing the most relevant examples from a larger pool of available data to focus on. This is called 'test-time curriculum reinforcement learning' or TTC-RL. The model then uses these selected examples to continue training itself as it works on the main task.

Why it matters?

This research is important because it shows a way to make models more adaptable and capable without constant human intervention. The experiments showed significant improvements in performance on difficult math and coding challenges, suggesting that this 'learn-as-you-go' approach could help models reach a higher level of skill and handle more complex tasks.

Abstract

Humans are good at learning on the job: We learn how to solve the tasks we face as we go along. Can a model do the same? We propose an agent that assembles a task-specific curriculum, called test-time curriculum (TTC-RL), and applies reinforcement learning to continue training the model for its target task. The test-time curriculum avoids time-consuming human curation of datasets by automatically selecting the most task-relevant data from a large pool of available training data. Our experiments demonstrate that reinforcement learning on a test-time curriculum consistently improves the model on its target tasks, across a variety of evaluations and models. Notably, on challenging math and coding benchmarks, TTC-RL improves the pass@1 of Qwen3-8B by approximately 1.8x on AIME25 and 2.1x on CodeElo. Moreover, we find that TTC-RL significantly raises the performance ceiling compared to the initial model, increasing pass@8 on AIME25 from 40% to 62% and on CodeElo from 28% to 43%. Our findings show the potential of test-time curricula in extending the test-time scaling paradigm to continual training on thousands of task-relevant experiences during test-time.

View Paper