A Case Study of Web App Coding with OpenAI Reasoning Models
Yi Cui
2024-09-24

Summary
This paper discusses a study comparing OpenAI's new reasoning models, o1-preview and o1-mini, in their ability to perform coding tasks. It introduces a new benchmark called WebApp1K-Duo to evaluate these models more effectively.
What's the problem?
While OpenAI's models have shown promise in coding tasks, their performance on a wider range of tasks specific to web applications was not fully understood. Previous benchmarks were limited, and the new models needed to be tested against more challenging scenarios to see how well they could adapt and perform.
What's the solution?
To address this, the researchers created the WebApp1K-Duo benchmark, which doubles the number of tasks and test cases compared to earlier benchmarks. They found that while the o1 models performed well on simpler tasks, their performance dropped significantly when faced with more complex or atypical cases. The study suggests that the variability in performance may be due to how well the models understand instructions, especially when key details are missing.
Why it matters?
This research is important because it helps identify the strengths and weaknesses of OpenAI's reasoning models in coding tasks. By understanding how these models perform under different conditions, developers can make better choices about which model to use for specific applications, ultimately leading to improved tools for coding and software development.
Abstract
This paper presents a case study of coding tasks by the latest reasoning models of OpenAI, i.e. o1-preview and o1-mini, in comparison with other frontier models. The o1 models deliver SOTA results for WebApp1K, a single-task benchmark. To this end, we introduce WebApp1K-Duo, a harder benchmark doubling number of tasks and test cases. The new benchmark causes the o1 model performances to decline significantly, falling behind Claude 3.5. Moreover, they consistently fail when confronted with atypical yet correct test cases, a trap non-reasoning models occasionally avoid. We hypothesize that the performance variability is due to instruction comprehension. Specifically, the reasoning mechanism boosts performance when all expectations are captured, meanwhile exacerbates errors when key expectations are missed, potentially impacted by input lengths. As such, we argue that the coding success of reasoning models hinges on the top-notch base model and SFT to ensure meticulous adherence to instructions.