Co-Evolving Policy Distillation

Naibin Gu, Chenxu Yang, Qingyi Si, Chuanyu Qin, Dingyu Yao, Peng Fu, Zheng Lin, Weiping Wang, Nan Duan, Jiaqi Wang

2026-05-01

Summary

This paper investigates how to best combine different skills, like understanding text, images, and videos, into a single AI model. It looks at two common methods for doing this – RLVR and OPD – and proposes a new, improved approach called CoPD.

What's the problem?

Currently, when trying to merge different AI 'experts' into one model, there are issues. Simply mixing RLVR can cause the experts to drift apart and lose their individual strengths. Alternatively, training the experts first and *then* using OPD avoids this drift, but the final model doesn't fully benefit from everything the original experts knew because their behaviors are too different from the combined model's behavior. Essentially, there's a gap in transferring knowledge effectively.

What's the solution?

The researchers developed Co-Evolving Policy Distillation, or CoPD. Instead of training experts separately and *then* combining them, CoPD trains all the experts *together* and uses OPD throughout the entire training process. Importantly, the experts learn from each other in both directions – they act as mutual teachers. This constant interaction helps them stay aligned and ensures the combined model captures all the knowledge from each individual expert.

Why it matters?

This research is important because it provides a more effective way to build AI systems that can handle a variety of tasks. CoPD significantly outperforms existing methods in combining different reasoning abilities, even surpassing experts trained specifically for a single task. Furthermore, the way CoPD trains models in parallel could lead to new strategies for scaling up AI training in the future, allowing us to build even more powerful and versatile AI systems.

Abstract

RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capability loss in different ways: mixed RLVR suffers from inter-capability divergence cost, while the pipeline of first training experts and then performing OPD, though avoiding divergence, fails to fully absorb teacher capabilities due to large behavioral pattern gaps between teacher and student. We propose Co-Evolving Policy Distillation (CoPD), which encourages parallel training of experts and introduces OPD during each expert's ongoing RLVR training rather than after complete expert training, with experts serving as mutual teachers (making OPD bidirectional) to co-evolve. This enables more consistent behavioral patterns among experts while maintaining sufficient complementary knowledge throughout. Experiments validate that CoPD achieves all-in-one integration of text, image, and video reasoning capabilities, significantly outperforming strong baselines such as mixed RLVR and MOPD, and even surpassing domain-specific experts. The model parallel training pattern offered by CoPD may inspire a novel training scaling paradigm.

View Paper