Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity
Dylan Zhang, Justin Wang, Tianran Sun
2025-02-18
Summary
This paper talks about PoPilot, a new AI system designed to handle proof-oriented programming, which is a way of writing computer programs that includes mathematical proofs to show the program is correct. PoPilot is significantly better at this task than previous AI models, including the powerful GPT-4o.
What's the problem?
Current AI language models aren't very good at proof-oriented programming because there isn't enough data to train them properly. There aren't many examples of proof-oriented programming languages like F*, and there aren't many large-scale projects that use this type of programming. This makes it hard for AI to learn how to do this complex task.
What's the solution?
The researchers created PoPilot, a 14 billion parameter AI model, and trained it in a clever way. They made up new examples of proof-oriented programming problems, used a variety of coding data to help the AI learn how to reason, and created new proofs and fixes for existing code. This helped PoPilot learn how to both create and fix proofs for individual functions and entire software projects.
Why it matters?
This matters because proof-oriented programming can make software more reliable and secure by mathematically proving it works correctly. Having an AI that's really good at this could help programmers create better, safer software more quickly. The fact that PoPilot is 64% better than previous top models and can even improve GPT-4o's performance by 54% shows a big leap forward in this area of AI and programming.
Abstract
Existing LMs struggle with proof-oriented programming due to data scarcity, which manifest in two key ways: (1) a lack of sufficient corpora for <PRE_TAG>proof-oriented programming languages</POST_TAG> such as F*, and (2) the absence of large-scale, project-level proof-oriented implementations that can teach the model the intricate reasoning process when performing proof-oriented programming. We present the first on synthetic data augmentation for project level proof oriented programming for both generation and repair. Our method addresses data scarcity by synthesizing basic proof-oriented programming problems for proficiency in that language; incorporating diverse coding data for reasoning capability elicitation and creating new proofs and repair data within existing repositories. This approach enables language models to both synthesize and repair proofs for function- and repository-level code. We show that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of the models that outperforms GPT-4o in project-level <PRE_TAG>proof-oriented programming</POST_TAG> by 64% relative margin, and can improve GPT-4o's performance by 54% by repairing its outputs over GPT-4o's self-<PRE_TAG>repair</POST_TAG>.