LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, Li Yuan
2024-11-18

Summary
This paper introduces LLaVA-o1, a new vision-language model designed to improve reasoning by breaking down complex tasks into step-by-step processes.
What's the problem?
While large language models (LLMs) have made great strides in reasoning, current vision-language models (VLMs) often struggle with systematic reasoning, especially when answering complex questions based on images. This makes it difficult for them to perform well in tasks that require both visual and textual understanding.
What's the solution?
LLaVA-o1 addresses this issue by using a structured approach that involves multiple stages: summarization, visual interpretation, logical reasoning, and conclusion generation. Instead of relying on simple prompts, LLaVA-o1 independently processes each stage to enhance its reasoning capabilities. The researchers also created a dataset called LLaVA-o1-100k to train the model, which includes examples from various visual question-answering tasks and structured reasoning annotations. They introduced a new method called stage-level beam search to improve the model's performance during inference.
Why it matters?
This research is significant because LLaVA-o1 demonstrates that structured reasoning can lead to better performance in multimodal tasks compared to traditional methods. By achieving an 8.9% improvement over its base model and outperforming larger models like GPT-4o-mini and Gemini-1.5-pro, LLaVA-o1 shows the potential for open-source models to compete with proprietary ones, paving the way for advancements in AI that can better understand and reason about the world.
Abstract
Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-o1, a novel VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-o1 independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-o1 to achieve marked improvements in precision on reasoning-intensive tasks. To accomplish this, we compile the LLaVA-o1-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose an inference-time stage-level beam search method, which enables effective inference-time scaling. Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-o1 not only outperforms its base model by 8.9% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.