Thinking LLMs: General Instruction Following with Thought Generation
Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar
2024-10-15

Summary
This paper introduces a new method for training large language models (LLMs) to think before answering questions, which improves their ability to follow instructions.
What's the problem?
Most LLMs are trained to respond to user questions directly without taking the time to think through their answers. This can lead to mistakes, especially with complex questions that require reasoning and planning. The lack of explicit thinking makes it hard for these models to provide high-quality responses.
What's the solution?
The authors propose a training method that teaches LLMs to generate their thoughts before providing an answer. They use an iterative process where the model explores different ways of thinking about a question and scores these thoughts based on how well they lead to good answers. This method allows the model to learn how to think effectively without needing extra human data for training.
Why it matters?
This research is important because it enhances the capabilities of AI models, allowing them to handle more complex tasks and provide better responses. By teaching LLMs to think before they answer, this method can improve performance in various areas, including marketing, health, and general knowledge, making these models more useful in real-world applications.
Abstract
LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and planning -- but can be applied to any task. We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision. For each instruction, the thought candidates are scored using a judge model to evaluate their responses only, and then optimized via preference optimization. We show that this procedure leads to superior performance on AlpacaEval and Arena-Hard, and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning & problem-solving tasks.