OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

Xiangyu Zhao, Shengyuan Ding, Zicheng Zhang, Haian Huang, Maosong Cao, Weiyun Wang, Jiaqi Wang, Xinyu Fang, Wenhai Wang, Guangtao Zhai, Haodong Duan, Hua Yang, Kai Chen

2025-02-26

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

Summary

This paper talks about OmniAlign-V, a new dataset designed to make AI models that can understand both text and images (called MLLMs) better at responding in ways that humans prefer

What's the problem?

Current AI models that work with both text and images are really good at basic tasks, but they're not great at understanding and responding in ways that align with human preferences and values. This means they might give answers that are technically correct but not what people actually want or find helpful

What's the solution?

The researchers created OmniAlign-V, a large dataset with 200,000 high-quality examples that include diverse images, complex questions, and different types of responses. They also made MM-AlignBench, a special test to check how well AI models match human values. By training AI models on this new dataset, the researchers were able to make the models better at aligning with human preferences without losing their ability to perform well on standard tasks

Why it matters?

This matters because as AI becomes more common in our daily lives, we need it to understand and respond to us in ways that feel natural and align with our values. By making AI models that are better at this, we can create more helpful and trustworthy AI assistants that can understand complex situations involving both text and images. This could lead to better AI tools for education, customer service, and many other areas where understanding human preferences is crucial

Abstract

Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs' alignment with human preferences. We also present MM-AlignBench, a human-annotated benchmark specifically designed to evaluate MLLMs' alignment with human values. Experimental results show that finetuning MLLMs with OmniAlign-V, using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), significantly enhances human preference alignment while maintaining or enhancing performance on standard VQA benchmarks, preserving their fundamental capabilities. Our datasets, benchmark, code and checkpoints have been released at https://github.com/PhoenixZ810/OmniAlign-V.

View Paper