S*: Test Time Scaling for Code Generation
Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica
2025-02-21
Summary
This paper talks about S*, a new way to make AI models better at writing computer code. It's like giving the AI a smart proofreader that helps it write better code by checking and improving its work multiple times before submitting the final version.
What's the problem?
AI models are getting better at many tasks when given more time to think, especially in math. But for writing code, we haven't really explored how to use this extra thinking time effectively. It's like we've been focusing on making calculators better at math, but not really looking at how to make them better at writing instructions for computers.
What's the solution?
The researchers created S*, which does two main things. First, it generates multiple versions of code and then improves each version step by step, like brainstorming ideas and then refining them. Second, it uses a clever way to compare different versions of the code it creates, actually running the code to see which version works best. This is like having the AI not just write the code, but also test it thoroughly to pick the best version.
Why it matters?
This matters because it makes AI much better at writing code, even allowing smaller AI models to outperform larger ones. It helps AI that isn't specifically designed for coding tasks to do better than specialized coding AI in some cases. This could lead to more powerful and efficient coding assistants, making it easier for people to create software and potentially speeding up technological progress in many fields that rely on computer programming.
Abstract
Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially improves the coverage and selection accuracy of generated code. S* extends the existing parallel scaling paradigm with sequential scaling to push performance boundaries. It further leverages a novel selection mechanism that adaptively generates distinguishing inputs for pairwise comparison, combined with execution-grounded information to robustly identify correct solutions. We evaluate across 12 Large Language Models and Large Reasoning Model and show: (1) S* consistently improves performance across model families and sizes, enabling a 3B model to outperform GPT-4o-mini; (2) S* enables non-reasoning models to surpass reasoning models - GPT-4o-mini with S* outperforms o1-preview by 3.7% on LiveCodeBench; (3) S* further boosts state-of-the-art reasoning models - DeepSeek-R1-Distill-Qwen-32B with S* achieves 85.7% on LiveCodeBench, approaching o1 (high) at 88.5%. Code will be available under https://github.com/NovaSky-AI/SkyThought.