MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui Wang, Afshin Dehghan, Peter Grasch

2024-10-01

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Summary

This paper presents MM1.5, a new family of multimodal large language models (MLLMs) that improve how machines understand and reason about images and text together.

What's the problem?

Current models struggle with tasks that involve understanding complex images with text, making it hard for them to perform well in applications like analyzing documents or interpreting visual content. They often need better training methods and data to improve their performance.

What's the solution?

MM1.5 addresses these issues by using a data-centric approach to training, which means it carefully selects and combines different types of data throughout the training process. This includes using high-quality optical character recognition (OCR) data and synthetic captions to enhance the model's ability to understand text-rich images. The model has various versions, ranging from smaller (1 billion parameters) to larger (30 billion parameters), and includes specialized versions for video and mobile user interface understanding.

Why it matters?

This research is important because it enhances the capabilities of multimodal models, allowing them to better interpret and analyze visual information alongside text. By improving how machines understand complex data, MM1.5 could lead to advancements in fields such as education, healthcare, and content creation, making technology more effective in real-world applications.

Abstract

We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding. Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.

View Paper