CogVLM2: Visual Language Models for Image and Video Understanding
Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, Debing Liu, Bin Xu
2024-08-30

Summary
This paper introduces CogVLM2, a new type of visual language model designed to help computers understand images and videos better by combining visual and language information.
What's the problem?
As technology advances, we need models that can not only process text but also understand images and videos in a meaningful way. Existing models often struggle to effectively combine visual data with language, making it hard for them to perform well in tasks that require both types of information.
What's the solution?
CogVLM2 improves on previous models by using a more advanced architecture that supports high-resolution images (up to 1344x1344 pixels) and can analyze video frames with timestamps. It includes different versions like CogVLM2 for images and CogVLM2-Video for videos. The model has been trained extensively to ensure it performs well across various benchmarks, achieving state-of-the-art results in multiple tests.
Why it matters?
This research is significant because it enhances how machines interpret complex visual information alongside language, which can improve applications in areas like automated image captioning, video analysis, and interactive AI systems. By making these tools more effective, we can create smarter technologies that better understand the world around us.
Abstract
Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to 1344 times 1344 pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.