On Domain-Specific Post-Training for Multimodal Large Language Models

Daixuan Cheng, Shaohan Huang, Ziyu Zhu, Xintong Zhang, Wayne Xin Zhao, Zhongzhi Luan, Bo Dai, Zhenliang Zhang

2024-12-02

On Domain-Specific Post-Training for Multimodal Large Language Models

Summary

This paper discusses the development of a method to adapt large multimodal language models (MLLMs) for specific fields, like science and industry, by using a process called post-training.

What's the problem?

While general MLLMs have made significant progress, they often struggle to perform well in specialized areas because they are not tailored to the unique language and requirements of those fields. This can limit their effectiveness in tasks that need specific knowledge or terminology.

What's the solution?

The authors introduce a systematic approach to improve MLLMs by focusing on three key areas: data synthesis, training methods, and task evaluation. They create a tool that generates visual instruction tasks from domain-specific image-caption pairs, which helps enhance the model's performance in those areas. Instead of using a common two-stage training process, they apply a single-stage training pipeline to increase task diversity. They also evaluate the model’s performance in specific domains, such as biomedicine and food, using various MLLMs and assess how well they adapt to these specialized tasks.

Why it matters?

This research is important because it helps make AI models more useful in professional settings where specialized knowledge is crucial. By improving how these models learn and adapt to specific fields, it can lead to better support for healthcare providers, scientists, and other professionals who rely on accurate information tailored to their needs.

Abstract

Recent years have witnessed the rapid development of general multimodal large language models (MLLMs). However, adapting general MLLMs to specific domains, such as scientific fields and industrial applications, remains less explored. This paper systematically investigates domain adaptation of MLLMs through post-training, focusing on data synthesis, training pipelines, and task evaluation. (1) Data Synthesis: Using open-source models, we develop a visual instruction synthesizer that effectively generates diverse visual instruction tasks from domain-specific image-caption pairs. Our synthetic tasks surpass those generated by manual rules, GPT-4, and GPT-4V in enhancing the domain-specific performance of MLLMs. (2) Training Pipeline: While the two-stage training--initially on image-caption pairs followed by visual instruction tasks--is commonly adopted for developing general MLLMs, we apply a single-stage training pipeline to enhance task diversity for domain-specific post-training. (3) Task Evaluation: We conduct experiments in two domains, biomedicine and food, by post-training MLLMs of different sources and scales (e.g., Qwen2-VL-2B, LLaVA-v1.6-8B, Llama-3.2-11B), and then evaluating MLLM performance on various domain-specific tasks. To support further research in MLLM domain adaptation, we will open-source our implementations.

View Paper