GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

Tong Wei, Yijun Yang, Changhao Zhang, Junliang Xing, Yuanchun Shi, Zongqing Lu, Deheng Ye

2025-12-26

GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

Summary

This paper introduces a new method, GTR-Turbo, to improve how artificial intelligence agents learn from both images and language. It focuses on making these agents better at completing tasks over multiple steps, like following instructions in a visual environment.

What's the problem?

Teaching AI agents to perform complex tasks that require many steps is difficult because they often don't get feedback until the very end, making it hard to know which actions were good or bad. Existing solutions try to get more frequent feedback by asking a powerful 'teacher' AI model, but these teacher models are expensive to use and not everyone has access to them, hindering research and practical applications. Also, previous methods sometimes led to the agent becoming overly cautious and not exploring different strategies.

What's the solution?

GTR-Turbo solves this by creating a 'free' teacher model directly from the agent's own learning process. Instead of relying on a separate, expensive AI, it combines different versions of the agent created during training. This combined model then guides the agent's further learning through a process similar to having a tutor provide hints. This approach avoids the need for a privileged teacher model, prevents the agent from becoming too cautious, and keeps the training process stable.

Why it matters?

This research is important because it makes multi-step AI learning more accessible and efficient. By removing the dependency on costly teacher models, more researchers and developers can work on building intelligent agents. The improvements in accuracy and training speed mean these agents can learn to perform complex tasks more effectively, potentially leading to advancements in areas like robotics and virtual assistants.

Abstract

Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a "free" teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the "entropy collapse" observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.

View Paper