Is Programming by Example solved by LLMs? (2406.08316v3)

Published 12 Jun 2024 in cs.CL, cs.AI, cs.LG, cs.PL, and cs.SE

Abstract: Programming-by-Examples (PBE) aims to generate an algorithm from input-output examples. Such systems are practically and theoretically important: from an end-user perspective, they are deployed to millions of people, and from an AI perspective, PBE corresponds to a very general form of few-shot inductive inference. Given the success of LLMs in code-generation tasks, we investigate here the extent to which LLMs can be said to have "solved" PBE. We experiment on classic domains such as lists and strings, and an uncommon graphics programming domain not well represented in typical pretraining data. We find that pretrained models are not effective at PBE, but that they can be fine-tuned for much higher performance, provided the test problems are in-distribution. We analyze empirically what causes these models to succeed and fail, and take steps toward understanding how to achieve better out-of-distribution generalization. Collectively these results suggest that LLMs make strong progress toward solving the typical suite of PBE tasks, potentially increasing the flexibility and applicability of PBE systems, while also identifying ways in which LLMs still fall short.

PDF Abstract

An Examination of LLMs in Programming-by-Example

The paper "Is Programming by Example solved by LLMs?" authored by Wen-Ding Li and Kevin Ellis investigates the potential of LLMs to address Programming-by-Example (PBE), a domain that involves creating programs based solely on input-output pairs. PBE has practical applications, such as automating repetitive tasks for a wide array of non-programmer users, as well as theoretical importance within AI for tasks involving few-shot learning and inductive reasoning.

Scope of the Study

The authors explore the effectiveness of LLMs, pretrained for code generation, across several PBE tasks, including list manipulations, string transformations, and generating programs for graphics in LOGO/Turtle environments. These domains vary from common list operations typically encountered in programming exercises to representing graphical patterns in less conventional forms not typically included in pretraining datasets.

Key Findings

Pretrained Performance: Initial results indicate that LLMs, when used in their pretrained form, perform poorly on PBE tasks. They cannot adequately generalize from few input-output examples to deduce the underlying program logic.
Fine-tuning Success: Once fine-tuned with relevant PBE data, LLMs exhibit significantly improved performance, especially when the test problems are close to the distribution of the fine-tuning data. The paper reports that their fine-tuned models surpassed established baselines in list and string manipulation tasks, as well as graphics programs.
Generalization Limits: The paper highlights the challenge LLMs face in out-of-distribution generalization. Fine-tuning solutions work well on problems similar to those in the fine-tuning dataset but struggle with broader scopes where test problems differ significantly.
Adaptation Potential: The authors propose an adaptation strategy where models are iteratively fine-tuned on small, unlabeled datasets of the application domain, allowing improvement in handling domain shifts. The adaptation approach significantly enhances the neural network's ability to generalize beyond the initially fine-tuned distribution, albeit not completely.

Implications and Future Directions

The improvement of LLMs in PBE tasks through fine-tuning suggests that they may serve as a viable foundation for flexible PBE systems across multiple domains. The research underscores an important transition: leveraging LLMs for inducing code in Turing-complete languages rather than constrained domain-specific ones, offering enhanced flexibility and potential application breadth.

This work suggests substantial future research paths:

Optimization of Fine-Tuning Processes: Further refinement in fine-tuning strategies, including better selection of seed datasets and more efficient adaptation methodologies, will be crucial to enhance out-of-distribution performance.
Exploration of Smaller Models: Given the computational resources required by large models, there is value in studying how smaller or more efficient models could be used, possibly aided by techniques like model compression or distillation.
Real-world Applicability: Evaluating LLM-based PBE solutions in real-world tasks and exploring user interfaces for non-developers to leverage these systems effectively will be necessary steps toward integration into mainstream applications.
Theoretical Insights into LLM Behavior: Understanding what aspects of the problem domain and dataset influence LLM performance in PBE will provide insights into model interpretability and reliability.

This paper provides a valuable exploration of the capabilities and limitations of LLMs in PBE tasks, illustrating both promising advancements and areas requiring further research within the intersection of machine learning, program synthesis, and human-computer interaction.