One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling

Yiyuan Li, Zhen Huang, Yanan Wu, Weixun Wang, Xuefeng Li, Yijia Luo, Wenbo Su, Bo Zheng, Pengfei Liu

2026-01-09

One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling

Summary

This paper explores how to improve the reasoning skills of large language models, which are the AI systems behind things like chatbots and writing assistants.

What's the problem?

Currently, getting these language models to *really* reason well usually requires a huge amount of training data – think thousands of examples. This is expensive and time-consuming. The researchers questioned if it was truly necessary to have so much data, or if a smarter approach could work with less.

What's the solution?

The researchers developed a technique called 'polymath learning'. Instead of feeding the model tons of examples, they carefully designed *one* single example, focused on math reasoning, that would help the model improve across many different subjects like physics, chemistry, and biology. They found that this single, well-crafted example, and even a synthetic one they created, was surprisingly effective, often outperforming training with much larger datasets.

Why it matters?

This research suggests that the key to improving AI reasoning isn't just about throwing more data at the problem. It's about carefully *engineering* the training examples to be high-quality and strategically designed. This 'sample engineering' approach could make it much easier and cheaper to build AI systems that can think and reason more effectively.

Abstract

The reasoning ability of large language models (LLMs) can be unleashed with reinforcement learning (RL) (OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025). The success of existing RL attempts in LLMs usually relies on high-quality samples of thousands or beyond. In this paper, we challenge fundamental assumptions about data requirements in RL for LLMs by demonstrating the remarkable effectiveness of one-shot learning. Specifically, we introduce polymath learning, a framework for designing one training sample that elicits multidisciplinary impact. We present three key findings: (1) A single, strategically selected math reasoning sample can produce significant performance improvements across multiple domains, including physics, chemistry, and biology with RL; (2) The math skills salient to reasoning suggest the characteristics of the optimal polymath sample; and (3) An engineered synthetic sample that integrates multidiscipline elements outperforms training with individual samples that naturally occur. Our approach achieves superior performance to training with larger datasets across various reasoning benchmarks, demonstrating that sample quality and design, rather than quantity, may be the key to unlock enhanced reasoning capabilities in language models. Our results suggest a shift, dubbed as sample engineering, toward precision engineering of training samples rather than simply increasing data volume.

View Paper