Inverse Scaling in Test-Time Compute

Aryo Pradipta Gema, Alexander Hägele, Runjin Chen, Andy Arditi, Jacob Goldman-Wetzler, Kit Fraser-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, Pasquale Minervini, Yanda Chen, Joe Benton, Ethan Perez

2025-07-22

Summary

This paper talks about inverse scaling in large reasoning models, which means that sometimes when you give more computing power during testing, the AI's performance actually gets worse instead of better.

What's the problem?

The problem is that while we expect AI models to perform better with more computing resources, in some cases, adding extra reasoning steps or more compute makes the models repeat mistakes or fail at tasks they could handle before.

What's the solution?

The authors studied how test-time compute affects different lengths of reasoning and found that increasing compute can lead to poorer performance and stronger problematic reasoning patterns, showing that more thinking isn’t always better for AI models.

Why it matters?

This matters because it challenges the idea that simply giving AI more computing power will always improve results and highlights the need to find smarter ways to help models reason without accidentally making them worse.

Abstract

Evaluation of Large Reasoning Models across different reasoning lengths reveals that increased test-time compute can lead to performance degradation and amplify problematic reasoning patterns.

View Paper