Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding
Haolin Chen, Yihao Feng, Zuxin Liu, Weiran Yao, Akshara Prabhakar, Shelby Heinecke, Ricky Ho, Phil Mui, Silvio Savarese, Caiming Xiong, Huan Wang
2024-11-11

Summary
This paper discusses a new method called LaTent Reasoning Optimization (LaTRO) that helps large language models (LLMs) improve their reasoning skills during training without needing extra feedback.
What's the problem?
Even though LLMs have become very good at understanding and generating text, they still struggle with complex reasoning tasks that require multiple steps. Current methods to improve their reasoning often rely on prompts or examples, which can be inefficient and not always effective.
What's the solution?
LaTRO introduces a way to optimize the reasoning capabilities of LLMs by treating reasoning as a process of sampling from a hidden distribution. This method allows the models to learn how to reason better and evaluate their own reasoning quality at the same time, all without needing external feedback or special training data. The researchers tested LaTRO on various datasets and found that it significantly improved the accuracy of LLMs in reasoning tasks compared to previous models.
Why it matters?
This research is important because it shows that LLMs can be trained to reason more effectively on their own, which could lead to better performance in real-world applications like problem-solving, decision-making, and complex question answering. By unlocking these latent reasoning capabilities, LaTRO could enhance how AI systems assist users in various fields.
Abstract
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner. The code of LaTRO is available at https://github.com/SalesforceAIResearch/LaTRO.