QueST: Incentivizing LLMs to Generate Difficult Problems
Hanxu Hu, Xingxing Zhang, Jannis Vamvas, Rico Sennrich, Furu Wei
2025-10-21
Summary
This paper introduces a new method, called QueST, for creating difficult coding problems for large language models (LLMs). It addresses the issue that LLMs are limited by the amount of challenging coding data available for training.
What's the problem?
Large language models are getting really good at things like coding and math, but they need a lot of practice problems to improve. The problem is that there aren't enough *hard* coding problems available for them to learn from. Existing datasets are too small, and previous attempts to create more problems either just change existing easy problems or pick out hard ones that humans already made. This limits how well these models can scale and truly master complex coding tasks.
What's the solution?
The researchers developed QueST, which is a system that automatically generates challenging coding problems. It works in two main steps: first, it intelligently selects parts of problems to make them harder, and second, it refines the problem generator itself by only keeping the problems that are actually difficult. They then used these generated problems to train other, smaller models, either by having the bigger models show the smaller ones how to solve them or by using a trial-and-error learning process.
Why it matters?
This work is important because it provides a way to create a virtually unlimited supply of difficult coding problems. By training models on these problems, even smaller models can achieve performance comparable to much larger models, like going from a model with 8 billion parameters to one with 671 billion parameters. This means we can build powerful coding assistants without needing massive amounts of computing power or relying solely on human-created data, ultimately pushing the boundaries of what LLMs can do in competitive coding and reasoning.
Abstract
Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems. However, their scalability is limited by human-labeled datasets and the lack of large-scale, challenging coding problem training data. Existing competitive coding datasets contain only thousands to tens of thousands of problems. Previous synthetic data generation methods rely on either augmenting existing instruction datasets or selecting challenging problems from human-labeled data. In this paper, we propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning that directly optimizes specialized generators to create challenging coding problems. Our trained generators demonstrate superior capability compared to even GPT-4o at creating challenging problems that benefit downstream performance. We leverage QueST to generate large-scale synthetic coding problems, which we then use to distill from strong teacher models with long chain-of-thought or to conduct reinforcement learning for smaller models, proving effective in both scenarios. Our distillation experiments demonstrate significant performance gains. Specifically, after fine-tuning Qwen3-8B-base on 100K difficult problems generated by QueST, we surpass the performance of the original Qwen3-8B on LiveCodeBench. With an additional 112K examples (i.e., 28K human-written problems paired with multiple synthetic solutions), our 8B model matches the performance of the much larger DeepSeek-R1-671B. These findings indicate that generating complex problems via QueST offers an effective and scalable approach to advancing the frontiers of competitive coding and reasoning for large language models.