TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

Jiaqi Wang, Wenhao Zhang, Weijie Shi, Yaliang Li, James Cheng

2026-04-29

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

Summary

This paper investigates a method for teaching smaller AI models to perform complex tasks by learning from larger, more capable models, specifically in situations where the AI needs to have a 'conversation' or take multiple steps to achieve a goal.

What's the problem?

When trying to transfer knowledge from a powerful AI to a smaller one in multi-step tasks, a key issue arises: the smaller AI's learning process becomes unstable. As it makes mistakes over multiple steps, it drifts further away from the guidance provided by the larger AI, making the learning signal less and less reliable. This leads to increasing errors and ultimately, poor performance. The researchers call this 'Trajectory-Level KL Instability'.

What's the solution?

To fix this, the researchers developed a technique called TCOD, which stands for Temporal Curriculum On-Policy Distillation. Essentially, TCOD doesn't throw the smaller AI into complex, multi-step tasks right away. Instead, it starts with very short sequences of actions and gradually increases the complexity, allowing the smaller AI to learn more reliably and avoid compounding errors. It's like learning to walk before you run.

Why it matters?

This research is important because it makes it easier to create smaller, more efficient AI agents that can still perform complex tasks. By stabilizing the learning process, TCOD allows these smaller AIs to not only match the performance of their larger teachers but, in some cases, even exceed it and handle situations the teacher couldn't. This has implications for deploying AI in real-world applications where resources are limited.

Abstract

On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL divergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter-turn error compounding: as errors accumulate, the student is driven beyond the teacher's effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On-Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed to the student and progressively expands it from short to long with a curriculum schedule.Experimental results across four student-teacher pairs on three multi-turn agent benchmarks (ALFWorld, WebShop, ScienceWorld) show that TCOD mitigates KL escalation and enhances KL stability throughout training, improving agent performance by up to 18 points over vanilla OPD. Further evaluations show that TCOD can even surpass the teacher's performance and generalize to tasks on which the teacher fails.

View Paper