Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM

Junxian Li, Di Zhang, Xunzhi Wang, Zeying Hao, Jingdi Lei, Qian Tan, Cai Zhou, Wei Liu, Weiyun Wang, Zhe Chen, Wenhai Wang, Wei Li, Shufei Zhang, Mao Su, Wanli Ouyang, Yuqiang Li, Dongzhan Zhou

2024-08-15

Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM

Summary

This paper introduces ChemVLM, an open-source model that combines visual and chemical knowledge to help understand and analyze chemical images and texts.

What's the problem?

In chemistry, understanding both images (like molecular structures) and text (like reaction descriptions) is important, but current models often struggle to effectively analyze both types of information together. This makes it hard for researchers to get insights from chemical data.

What's the solution?

The authors developed ChemVLM, which uses a special architecture that includes a large language model for text and an image encoder for visual data. They trained this model on a diverse dataset containing chemical information, allowing it to answer questions that involve both images and text. The model was tested on various benchmarks and performed exceptionally well, achieving top results in most tasks.

Why it matters?

This research is important because it bridges the gap between visual and textual information in chemistry, making it easier for scientists to analyze complex data. By providing an effective tool for understanding chemical information, ChemVLM can enhance research and education in the field of chemistry.

Abstract

In this technical report, we propose ChemVLM, the first open-source multimodal large language model dedicated to the fields of chemistry, designed to address the incompatibility between chemical image understanding and text analysis. Built upon the VIT-MLP-LLM architecture, we leverage ChemLLM-20B as the foundational large model, endowing our model with robust capabilities in understanding and utilizing chemical text knowledge. Additionally, we employ InternVIT-6B as a powerful image encoder. We have curated high-quality data from the chemical domain, including molecules, reaction formulas, and chemistry examination data, and compiled these into a bilingual multimodal question-answering dataset. We test the performance of our model on multiple open-source benchmarks and three custom evaluation sets. Experimental results demonstrate that our model achieves excellent performance, securing state-of-the-art results in five out of six involved tasks. Our model can be found at https://huggingface.co/AI4Chem/ChemVLM-26B.

View Paper