HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
Ling Yang, Xinchen Zhang, Ye Tian, Chenming Shang, Minghao Xu, Wentao Zhang, Bin Cui
2025-02-18
Summary
This paper talks about Multimodal Retrieval-Augmented Generation (RAG), which is a new way to make AI language models smarter by letting them use different types of information like text, images, audio, and video when answering questions or creating content.
What's the problem?
Regular AI language models can sometimes make stuff up or give outdated information because they only know what they were originally taught. They also struggle to understand and connect different types of information, like matching a picture with a description.
What's the solution?
The researchers studied how Multimodal RAG systems work and looked at all the different parts that make them up. They examined the special datasets used to train these systems, how to measure if they're doing a good job, and the clever tricks used to make the AI understand and use different types of information together. They also explored how to make these systems more reliable and able to handle real-world situations.
Why it matters?
This matters because it helps create AI that can understand and use information more like humans do, by combining different types of data. This could lead to smarter digital assistants, better search engines, and AI that can help with complex tasks in fields like healthcare or education. By laying out what we know and what still needs work, this study guides future research to make AI systems that are more capable and trustworthy.
Abstract
The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: https://github.com/Gen-Verse/HermesFlow