Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL

Jiarui Yao, Yifan Hao, Hanning Zhang, Hanze Dong, Wei Xiong, Nan Jiang, Tong Zhang

2025-05-06

Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization
in Rejection Sampling and RL

Summary

This paper talks about GVM-RAFT, a new way to help AI models that solve problems by thinking through each step, making their answers more accurate and efficient.

What's the problem?

When AI tries to reason step by step, it can waste a lot of computer power and sometimes doesn't get better at solving problems quickly, which slows down learning and can hurt accuracy.

What's the solution?

The researchers developed a smart method that adjusts how much computer effort is used at each step, so the AI learns faster and gives better answers by focusing resources where they're needed most.

Why it matters?

This matters because it makes AI models better at solving complex problems, saving time and energy, and helping them become more reliable for things like tutoring, research, and decision-making.

Abstract

GVM-RAFT, a dynamic sampling strategy for chain-of-thought reasoning in large language models, improves convergence and accuracy by adaptively allocating computational resources.

View Paper