AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

Zihan Wang, Jiaze Chen, Zhicheng Liu, Markus Mak, Yidi Du, Geonsik Moon, Luoqi Xu, Aaron Tua, Kunshuo Peng, Jiayi Lu, Mingfei Xia, Boqian Zou, Chenyang Ran, Guang Tian, Shoutai Zhu, Yeheng Duan, Zhenghui Kang, Zhenxing Lin, Shangshu Li, Qiang Luo, Qingshen Long, Zhiyong Chen

2025-08-25

AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

Summary

This paper discusses how we currently think AI models are better at programming than they actually are, and introduces a new, harder test to get a more accurate understanding of their abilities.

What's the problem?

Currently, the tests used to measure how well AI models can code are too easy and don't cover the full range of skills a good programmer needs. Also, the tests themselves aren't very good – they don't always catch mistakes or properly challenge the AI. This leads to an overestimation of how well these models can really code, hiding the fact that they still lag behind skilled human programmers.

What's the solution?

The researchers created a new benchmark called AetherCode. This benchmark uses problems from really tough programming competitions like the International Olympiad in Informatics and the International Collegiate Programming Contest. They also made sure the tests for these problems are very thorough, using both automatically generated tests and tests created and checked by human experts to ensure they are reliable and challenging.

Why it matters?

AetherCode provides a more realistic way to evaluate AI coding abilities. By using harder problems and better tests, it gives a clearer picture of where AI stands compared to human programmers, which is important for guiding future research and development in this field. It sets a new, higher standard for evaluating these models.

Abstract

Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem design with robust evaluation, AetherCode provides a more faithful measure of LLM capabilities and sets a new standard for future research in code reasoning.

View Paper