Privileged Information Distillation for Language Models
Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, Massimo Caccia
2026-02-06
Summary
This paper explores how to teach AI agents to perform complex tasks, even when we can't directly show them *how* to think, only *what* actions to take. It focuses on a technique called 'distillation,' where a smart, already-trained AI (the teacher) passes its knowledge onto a new AI (the student).
What's the problem?
Normally, when you train an AI to be an agent, you give it examples of both the actions it should take *and* the reasoning behind those actions. But often, powerful AI systems are 'black boxes' – we can see what they do, but not *why* they do it. This makes it hard to train a new AI to mimic their behavior because we're missing crucial information about the thought process. Specifically, the paper tackles the issue of transferring knowledge from an AI that *has* access to extra helpful information during training (called 'privileged information') to one that doesn't have that information when it's actually used in the real world.
What's the solution?
The researchers developed two new methods to solve this problem. The first, called π-Distill, trains both the 'teacher' AI (which uses the extra information) and the 'student' AI (which doesn't) at the same time, using the same underlying model. The second, On-Policy Self-Distillation (OPSD), uses a reinforcement learning approach where the student AI tries to act like the teacher AI, and is penalized if it deviates too much. Both methods only use the actions taken by the teacher AI as learning signals, without needing to know the teacher's internal reasoning.
Why it matters?
These techniques are important because they allow us to learn from the best AI systems, even when those systems are complex and don't reveal their inner workings. This is especially useful for creating AI agents that can handle long, complicated tasks in environments where providing detailed instructions is difficult or impossible. The new methods outperform standard training techniques, meaning we can build more capable AI agents more efficiently.
Abstract
Training-time privileged information (PI) can enable language models to succeed on tasks they would otherwise fail, making it a powerful tool for reinforcement learning in hard, long-horizon settings. However, transferring capabilities learned with PI to policies that must act without it at inference time remains a fundamental challenge. We study this problem in the context of distilling frontier models for multi-turn agentic environments, where closed-source systems typically hide their internal reasoning and expose only action trajectories. This breaks standard distillation pipelines, since successful behavior is observable but the reasoning process is not. For this, we introduce π-Distill, a joint teacher-student objective that trains a PI-conditioned teacher and an unconditioned student simultaneously using the same model. Additionally, we also introduce On-Policy Self-Distillation (OPSD), an alternative approach that trains using Reinforcement Learning (RL) with a reverse KL-penalty between the student and the PI-conditioned teacher. We show that both of these algorithms effectively distill frontier agents using action-only PI. Specifically we find that π-Distill and in some cases OPSD, outperform industry standard practices (Supervised finetuning followed by RL) that assume access to full Chain-of-Thought supervision across multiple agentic benchmarks, models, and forms of PI. We complement our results with extensive analysis that characterizes the factors enabling effective learning with PI, focusing primarily on π-Distill and characterizing when OPSD is competitive.