SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

Bohan Lyu, Siqiao Huang, Zichen Liang

2025-02-18

SURGE: On the Potential of Large Language Models as General-Purpose
Surrogate Code Executors

Summary

This paper talks about SURGE, a new way to test if large language models (LLMs) can predict what a piece of code will do without actually running it. It focuses on how well these models can act like 'code executors' by analyzing and understanding programs.

What's the problem?

While LLMs are good at generating and understanding code, we don't know how well they can predict the behavior or output of a program without running it. This is important because running code can be slow, expensive, or even risky in some cases. Current methods for testing this ability are limited and don't cover all the challenges that real-world programming presents.

What's the solution?

The researchers created SURGE, a benchmark that tests LLMs on eight different types of programming tasks, including analyzing buggy code, solving complex algorithms, and understanding programs that require specific environments to run. They evaluated several LLMs to see how well they could predict code behavior and studied how factors like model size and training data affect accuracy. They also identified common errors in the models' predictions and suggested ways to improve them.

Why it matters?

This matters because if LLMs can reliably predict what code will do without running it, they could save time and resources for programmers. This capability could also make AI tools safer by reducing the risk of running harmful or incorrect code. By providing a detailed evaluation framework, this research helps pave the way for better AI systems that can assist with complex coding tasks.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as code understanding and code generation. However, an equally important yet underexplored question is whether LLMs can serve as general-purpose surrogate code executors, to predict the output and behavior of a program without actually running it. To systematically investigate this capability, we introduce SURGE, a comprehensive benchmark covering eight key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. We evaluate multiple open-source and proprietary LLMs on SURGE and conduct a scaling study to analyze the impact of model size and training data scale on surrogate execution accuracy. Additionally, we categorize model prediction errors and explore potential areas for improvement. Our findings indicate that while LLMs can predict code execution results in certain cases, they exhibit limitations in general-purpose surrogate execution. This study provides empirical insights into the feasibility of using LLMs as surrogate code executors. Code and dataset are released at https://github.com/Imbernoulli/SURGE.

View Paper