LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!

Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

2025-02-12

LLMs Can Easily Learn to Reason from Demonstrations Structure, not
content, is what matters!

Summary

This paper talks about how AI models can learn to solve complex math and coding problems by focusing on the structure of reasoning rather than perfect examples. The researchers found that by using a specific training method and just 17,000 examples, they could dramatically improve an AI's problem-solving skills. Surprisingly, the exact content of the reasoning steps mattered less than their overall structure and order.

What's the problem?

Teaching AI to solve difficult problems usually requires them to follow long chains of logical steps. However, researchers weren't sure how to effectively train AIs to do this without using massive amounts of data or perfectly crafted examples.

What's the solution?

The team used a smart training approach that focused on teaching the AI the structure of problem-solving rather than memorizing specific solutions. They used efficient training techniques and a relatively small set of 17,000 examples. Importantly, they discovered that even when using examples with some mistakes, the AI still learned well as long as the overall reasoning structure was maintained.

Why it matters?

This research matters because it shows we can create smarter AI problem-solvers more efficiently. It means developers don't need perfect training data or huge computing power to make progress. This could lead to better AI tutors, coding assistants, and math-solving tools that are cheaper and faster to develop, potentially making advanced AI capabilities more accessible for various applications.

Abstract

Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (<PRE_TAG>Long CoT)</POST_TAG> that incorporate reflection, backtracking, and self-validation. However, the training techniques and data requirements to elicit Long CoT remain poorly understood. In this work, we find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA). With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks, including 56.7% (+40.0%) on AIME 2024 and 57.0% (+8.1%) on LiveCodeBench, competitive to the proprietary o1-preview model's score of 44.6% and 59.1%. More importantly, we find that the structure of Long CoT is critical to the learning process, whereas the content of individual reasoning steps has minimal impact. Perturbations affecting content, such as training on incorrect samples or removing reasoning keywords, have little impact on performance. In contrast, structural modifications that disrupt logical consistency in the Long CoT, such as shuffling or deleting reasoning steps, significantly degrade accuracy. For example, a model trained on Long CoT samples with incorrect answers still achieves only 3.2% lower accuracy compared to training with fully correct samples. These insights deepen our understanding of how to elicit reasoning capabilities in LLMs and highlight key considerations for efficiently training the next generation of reasoning models. This is the academic paper of our previous released Sky-T1-32B-Preview model. Codes are available at https://github.com/NovaSky-AI/SkyThought.

View Paper