ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Hui Sun, Yun-Ji Zhang, Zheng Xie, Ren-Biao Liu, Yali Du, Xin-Ye Li, Ming Li

2026-04-08

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Summary

This paper tackles the problem of automatically choosing the best code generated by large language models (LLMs) when multiple options are available. It focuses on using tests also generated by LLMs to evaluate the code, but recognizes that these tests aren't always reliable.

What's the problem?

When LLMs generate code, you often get several possible solutions. To pick the best one, people use tests to see which code works. However, if the tests themselves are flawed – meaning they sometimes incorrectly say good code is bad or vice versa – it's hard to know which code is *actually* the best. Previous methods either treated all tests as equally trustworthy or used simple rules to try and filter out bad tests, but figuring out if a test is good requires knowing which code is correct in the first place, creating a confusing loop.

What's the solution?

The researchers realized they didn't need to know if a test was absolutely correct, just if it was *consistent* in ranking the code options. They came up with a way to evaluate tests by temporarily removing one test, ranking the code based on how well it did on all the *other* tests, and then checking if the removed test's results agreed with that ranking. This is formalized as LOO-AUC. They then developed two methods, ACES-C and ACES-O, that use this idea to assign weights to tests, giving more importance to tests that consistently distinguish between good and bad code. These methods are efficient and only need to know which code passes or fails each test.

Why it matters?

This work is important because it provides a more reliable way to automatically select the best code generated by LLMs. By focusing on test consistency rather than absolute correctness, it breaks a fundamental problem in automated code evaluation and leads to better performance on standard code generation tasks, meaning LLMs can produce more useful and accurate code without as much human intervention.

Abstract

Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a circular dependency. Our key insight is that we need not determine test correctness at all: test votes should rank, not merely count. What matters is not how many codes pass a test, but whether the test can distinguish correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose ACES~(AUC ConsistEncy Scoring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@k on multiple code generation benchmarks.

View Paper