Self-Improvement in Multimodal Large Language Models: A Survey

Shijian Deng, Kai Wang, Tianyu Yang, Harsh Singh, Yapeng Tian

2025-10-06

Self-Improvement in Multimodal Large Language Models: A Survey

Summary

This paper is a review of how Large Language Models that can understand both text and images are getting better at improving themselves, without needing a ton of extra work from people.

What's the problem?

Large Language Models are getting really good, but usually need humans to constantly give them new information or tweak how they work. While models have gotten better at self-improvement using *just* text, nobody had really looked at how this works when the model also processes images. This means we're missing out on a lot of potential because images offer a different kind of data that could help these models learn and become more generally intelligent.

What's the solution?

The researchers looked at all the existing work on self-improvement for these text-and-image models. They broke down the different approaches into three main areas: how the models *get* new data, how that data is *organized* to be useful, and how the model itself is *adjusted* to learn from the data. They also discussed how people are testing these models and what they're being used for.

Why it matters?

This research is important because it's the first comprehensive look at self-improvement in models that handle both text and images. By organizing the current methods, it helps researchers understand what's already been done and where to focus their efforts to build even more powerful and adaptable AI systems that can learn from a wider range of information.

Abstract

Recent advancements in self-improvement for Large Language Models (LLMs) have efficiently enhanced model capabilities without significantly increasing costs, particularly in terms of human effort. While this area is still relatively young, its extension to the multimodal domain holds immense potential for leveraging diverse data sources and developing more general self-improving models. This survey is the first to provide a comprehensive overview of self-improvement in Multimodal LLMs (MLLMs). We provide a structured overview of the current literature and discuss methods from three perspectives: 1) data collection, 2) data organization, and 3) model optimization, to facilitate the further development of self-improvement in MLLMs. We also include commonly used evaluations and downstream applications. Finally, we conclude by outlining open challenges and future research directions.

View Paper