When Can LLMs Learn to Reason with Weak Supervision?

Salman Rahman, Jingyan Shen, Anna Mordvina, Hamid Palangi, Saadia Gabriel, Pavel Izmailov

2026-04-21

When Can LLMs Learn to Reason with Weak Supervision?

Summary

This paper investigates how well large language models can improve their reasoning skills when they're given limited or imperfect feedback during a learning process called reinforcement learning with verifiable rewards. It focuses on figuring out *when* this learning method actually works, and what makes it successful.

What's the problem?

Training these powerful language models to reason effectively requires good 'reward signals' – basically, telling the model when it's doing well or poorly. Creating these signals is hard as models get more complex. The researchers wanted to understand if reinforcement learning can still be effective when the feedback isn't perfect, specifically when there's very little data to learn from, the feedback is often wrong, or the feedback comes from the model trying to evaluate itself. The core issue is figuring out how to get these models to truly *learn* to reason, instead of just memorizing answers.

What's the solution?

The researchers ran a lot of experiments with different language models and reasoning tasks, testing them under those three 'weak supervision' scenarios. They discovered that successful models show a pattern where their performance steadily improves *along with* the reward they receive during training. Models that quickly max out their reward tend to just memorize, not generalize. They also found that a key factor predicting success is 'reasoning faithfulness' – whether the steps the model takes to reach an answer actually make logical sense. To improve performance, they combined two techniques: continuing to train the model on general text data, and then specifically fine-tuning it on examples that show the reasoning process step-by-step. This combination helped a smaller model, Llama3.2-3B-Base, succeed in all the challenging scenarios where it previously failed.

Why it matters?

This research is important because it helps us understand the limits and potential of using reinforcement learning to improve language models. It shows that we don't always need perfect feedback to get good results, but we need to focus on making sure the model is actually learning to reason, not just memorizing. The findings also highlight the importance of carefully preparing the model with both broad knowledge and specific examples of good reasoning, which can make these models more reliable and capable.

Abstract

Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.

View Paper