Inference Optimal VLMs Need Only One Visual Token but Larger Models
Kevin Y. Li, Sachin Goyal, Joao D. Semedo, J. Zico Kolter
2024-11-06

Summary
This paper discusses a new approach to using Vision Language Models (VLMs) efficiently by showing that they can perform well with just one visual token if paired with a larger model.
What's the problem?
Vision Language Models are powerful tools that combine visual and textual information, but they require a lot of computing power, especially when processing many visual tokens from images. This high demand can slow down their performance, making them less practical for real-world applications.
What's the solution?
The authors explored the best way to balance the number of visual tokens and the size of the language model to minimize processing time while maintaining accuracy. They discovered that using a larger language model with only one visual token often leads to better performance than using a smaller model with many tokens. This means that for certain tasks, it is more efficient to compress the visual input significantly while maximizing the model's capacity.
Why it matters?
This research is important because it helps improve the efficiency of VLMs, making them faster and more effective for real-world tasks like image recognition and understanding. By optimizing how these models process information, we can enhance their usability in various applications, such as autonomous vehicles and smart assistants.
Abstract
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. However, their real-world deployment is often constrained by high latency during inference due to substantial compute required to process the large number of input tokens (predominantly from the image) by the LLM. To reduce inference costs, one can either downsize the LLM or reduce the number of input image-tokens, the latter of which has been the focus of many recent works around token compression. However, it is unclear what the optimal trade-off is, as both the factors directly affect the VLM performance. We first characterize this optimal trade-off between the number of visual tokens and LLM parameters by establishing scaling laws that capture variations in performance with these two factors. Our results reveal a surprising trend: for visual reasoning tasks, the inference-optimal behavior in VLMs, i.e., minimum downstream error at any given fixed inference compute, is achieved when using the largest LLM that fits within the inference budget while minimizing visual token count - often to a single token. While the token reduction literature has mainly focused on maintaining base model performance by modestly reducing the token count (e.g., 5-10times), our results indicate that the compute-optimal inference regime requires operating under even higher token compression ratios. Based on these insights, we take some initial steps towards building approaches tailored for high token compression settings. Code is available at https://github.com/locuslab/llava-token-compression.