Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies for Execution-Grounded Code Generation
Jarrod Barnes
2026-02-11
Summary
This paper investigates the best way to get large language models to perform well on tasks where you can directly check if the answer is correct, like optimizing code for GPUs. It questions whether slightly adjusting the model during use (called test-time training) is the most effective approach.
What's the problem?
When using language models to solve tasks with clear right and wrong answers, a common strategy is to tweak the model’s parameters based on the feedback it receives. However, the researchers found that this tweaking often leads to the model becoming too focused on a single, potentially mediocre solution, and it actually performs worse than simply trying multiple different answers. Essentially, the model 'over-sharpens' its focus and misses better options.
What's the solution?
Instead of tweaking the model, the researchers discovered that it’s much better to simply generate many different possible solutions and then intelligently *select* the best one. They found that choosing the answer the model is *least* confident about – the one that 'surprises' it the most – but is still correct, leads to significantly better results. Picking the most surprising correct answer out of a set of possibilities dramatically improved performance, even matching the results of a perfect 'oracle' when considering the top three most surprising answers.
Why it matters?
This research shows that for tasks where you can directly verify the correctness of a solution, it’s more efficient to focus on exploring a diverse range of possibilities and smartly choosing from them, rather than trying to fine-tune the model itself. This is a valuable insight because it suggests a more cost-effective way to utilize powerful language models in areas like code optimization and other 'execution-grounded' tasks where clear feedback is available.
Abstract
Test-time training (TTT) adapts language models through gradient-based updates at inference. But is adaptation the right strategy? We study compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks, domains like GPU kernel optimization where a deterministic evaluator provides dense, continuous reward signals. Using KernelBench as our testbed and a 120B-parameter model (GPT-OSS-120B with LoRA adaptation), we find that search outperforms minimal adaptation (1-5 gradient steps): Best-of-N sampling achieves 90% task success (18/20 tasks) at K=64 across the full KernelBench L1 eval set while TTT's best checkpoint reaches only 30.6% (3-seed mean), with TTT's "equivalent K" falling below 1, worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions rather than discovering optimal ones. Our main contribution is surprisal-guided selection: selecting the highest-surprisal (lowest-confidence) correct sample yields 80% success vs. 50% for most-confident selection, a 30% improvement. Extending to surprisal-guided-top3 matches oracle performance at 100%. This zero-cost strategy, validated through length-controlled analysis, recovers oracle performance. For dense-reward VEG tasks, compute should be allocated to sample diversity and intelligent selection rather than gradient adaptation. The surprisal-guided selection principle may generalize to other execution-grounded domains where optimal solutions occupy the distribution tail.