Is Programming by Example solved by LLMs?

Wen-Ding Li, Kevin Ellis

2024-06-28

Is Programming by Example solved by LLMs?

Summary

This paper talks about the ability of large language models (LLMs) to solve Programming by Example (PBE) tasks, which involve generating algorithms based on input-output examples. The authors investigate whether LLMs can effectively perform PBE and how their performance can be improved.

What's the problem?

Programming by Example is a method where users provide examples of inputs and expected outputs, and the system generates the code that produces those outputs. While LLMs have shown success in generating code, there are questions about whether they can consistently handle PBE tasks, especially when the problems presented are not similar to those they were trained on. This raises concerns about their effectiveness in real-world applications where examples can vary widely.

What's the solution?

The authors conducted experiments using LLMs on classic programming tasks involving lists and strings, as well as a more challenging graphics programming domain. They found that while pretrained models struggled with PBE tasks, fine-tuning these models significantly improved their performance, especially when the test problems were similar to the training data. They also analyzed why these models succeeded or failed and explored ways to enhance their ability to generalize to new types of problems.

Why it matters?

This research is important because it highlights both the progress and limitations of LLMs in solving PBE tasks. By understanding how LLMs can be improved through fine-tuning and identifying their weaknesses, researchers can develop better programming tools that are more flexible and applicable in various scenarios. This could lead to more user-friendly systems for coding that help people generate algorithms without needing deep programming knowledge.

Abstract

Programming-by-Examples (PBE) aims to generate an algorithm from input-output examples. Such systems are practically and theoretically important: from an end-user perspective, they are deployed to millions of people, and from an AI perspective, PBE corresponds to a very general form of few-shot inductive inference. Given the success of Large Language Models (LLMs) in code-generation tasks, we investigate here the extent to which LLMs can be said to have `solved' PBE. We experiment on classic domains such as lists and strings, and an uncommon graphics programming domain not well represented in typical pretraining data. We find that pretrained models are not effective at PBE, but that they can be fine-tuned for much higher performance, provided the test problems are in-distribution. We analyze empirically what causes these models to succeed and fail, and take steps toward understanding how to achieve better out-of-distribution generalization. Collectively these results suggest that LLMs make strong progress toward solving the typical suite of PBE tasks, potentially increasing the flexibility and applicability of PBE systems, while also identifying ways in which LLMs still fall short.

View Paper