VoCo-LLaMA: Towards Vision Compression with Large Language Models

Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang

2024-06-19

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Summary

This paper introduces VoCo-LLaMA, a new method for compressing visual information in vision-language models (VLMs). It aims to improve how these models process images and videos by reducing the amount of data they need to handle while maintaining performance.

What's the problem?

Vision-language models have become very effective in tasks that involve both text and images, but they face challenges when processing high-resolution images and videos. These challenges include limited memory capacity and high computational costs, which can slow down performance. Previous methods for compressing visual data often resulted in losing important information because they used separate modules that didn't fully utilize how language models understand visual data.

What's the solution?

To address these issues, the authors developed VoCo-LLaMA, which compresses visual tokens directly within the language model itself. This is done by introducing special Vision Compression tokens during the training phase, allowing the model to learn how to process and understand compressed visual information more effectively. The method significantly reduces the amount of data processed—achieving a compression ratio of 576 times—while also speeding up processing time by nearly 70%. Additionally, VoCo-LLaMA can recognize patterns over time in video sequences, improving its ability to answer questions about videos.

Why it matters?

This research is important because it enhances the efficiency of vision-language models, making them faster and more capable of handling complex tasks involving images and videos. By reducing the amount of data needed without sacrificing quality, VoCo-LLaMA opens up new possibilities for using AI in applications like virtual reality, autonomous driving, and multimedia content creation. This advancement could lead to more accessible and powerful AI tools for various industries.

Abstract

Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos. Vision compression can alleviate this problem by reducing the vision token count. Previous approaches compress vision tokens with external modules and force LLMs to understand the compressed ones, leading to visual information loss. However, the LLMs' understanding paradigm of vision tokens is not fully utilised in the compression learning process. We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By introducing Vision Compression tokens during the vision instruction tuning phase and leveraging attention distillation, our method distill how LLMs comprehend vision tokens into their processing of VoCo tokens. VoCo-LLaMA facilitates effective vision compression and improves the computational efficiency during the inference stage. Specifically, our method achieves minimal performance loss with a compression ratio of 576times, resulting in up to 94.8% fewer FLOPs and 69.6% acceleration in inference time. Furthermore, through continuous training using time-series compressed token sequences of video frames, VoCo-LLaMA demonstrates the ability to understand temporal correlations, outperforming previous methods on popular video question-answering benchmarks. Our approach presents a promising way to unlock the full potential of VLMs' contextual window, enabling more scalable multi-modal applications. The project page, along with the associated code, can be accessed via https://yxxxb.github.io/VoCo-LLaMA-page/{this https URL}.

View Paper