Unified Reinforcement and Imitation Learning for Vision-Language Models

Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu

2025-10-23

Unified Reinforcement and Imitation Learning for Vision-Language Models

Summary

This paper introduces a new way to train smaller, more efficient vision-language models (VLMs) – which are AI systems that can understand both images and text – to perform almost as well as much larger, more powerful models.

What's the problem?

Large VLMs are really good at tasks involving images and text, but they require a lot of computing power and resources to run, making them impractical for use on devices with limited capabilities like phones or embedded systems. Essentially, they're too big and demanding to be widely used.

What's the solution?

The researchers developed a training method called Unified Reinforcement and Imitation Learning (RIL). This method combines two learning approaches: first, the smaller 'student' VLM learns by trying to copy the outputs of larger, more capable 'teacher' VLMs. Second, it improves through a reward system, similar to how you might train a dog with treats, where it gets 'rewarded' for generating good text. A special component, an LLM-based discriminator, helps judge how well the student is mimicking the teacher, and multiple teachers provide diverse examples for learning.

Why it matters?

This work is important because it allows for the creation of VLMs that are almost as good as the best available models, but much smaller and more efficient. This means these models can be used in more places, like on your phone or in applications where powerful computers aren't available, opening up a wider range of possibilities for AI applications.

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress, yet their large scale often renders them impractical for resource-constrained environments. This paper introduces Unified Reinforcement and Imitation Learning (RIL), a novel and efficient training algorithm designed to create powerful, lightweight VLMs. RIL distinctively combines the strengths of reinforcement learning with adversarial imitation learning. This enables smaller student VLMs not only to mimic the sophisticated text generation of large teacher models but also to systematically improve their generative capabilities through reinforcement signals. Key to our imitation framework is an LLM-based discriminator that adeptly distinguishes between student and teacher outputs, complemented by guidance from multiple large teacher VLMs to ensure diverse learning. This unified learning strategy, leveraging both reinforcement and imitation, empowers student models to achieve significant performance gains, making them competitive with leading closed-source VLMs. Extensive experiments on diverse vision-language benchmarks demonstrate that RIL significantly narrows the performance gap with state-of-the-art open- and closed-source VLMs and, in several instances, surpasses them.

View Paper