Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay (2402.04858v2)

Published 7 Feb 2024 in cs.AI, cs.CL, and cs.LG

Abstract: LLMs are increasingly solving tasks that are commonly believed to require human-level reasoning ability. However, these models still perform very poorly on benchmarks of general intelligence such as the Abstraction and Reasoning Corpus (ARC). In this paper, we approach ARC as a programming-by-examples problem, and introduce a novel and scalable method for LLM self-improvement called Code Iteration (CodeIt). Our method iterates between 1) program sampling and hindsight relabeling, and 2) learning from prioritized experience replay. By relabeling the goal of an episode (i.e., the target program output given input) to the realized output produced by the sampled program, our method effectively deals with the extreme sparsity of rewards in program synthesis. Applying CodeIt to the ARC dataset, we demonstrate that prioritized hindsight replay, along with pre-training and data-augmentation, leads to successful inter-task generalization. CodeIt is the first neuro-symbolic approach that scales to the full ARC evaluation dataset. Our method solves 15% of ARC evaluation tasks, achieving state-of-the-art performance and outperforming existing neural and symbolic baselines. Our code is available at https://github.com/Qualcomm-AI-research/codeit .

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Natasha Butt (2 papers)
  2. Blazej Manczak (3 papers)
  3. Auke Wiggers (13 papers)
  4. Corrado Rainone (25 papers)
  5. Michaël Defferrard (13 papers)
  6. Taco Cohen (36 papers)
  7. David W. Zhang (13 papers)
Citations (14)

Summary

  • The paper introduces CodeIt, a self-improving approach using prioritized hindsight replay that enhances ARC task performance by 15%.
  • It utilizes a two-stage process of program sampling with hindsight relabeling and prioritized experience replay to refine model outputs.
  • Ablation studies confirm that components like the ExIt mechanism and replay buffer significantly boost the model's problem-solving efficiency.

Introduction

The domain of general AI often grapples with the creation of models that can exhibit human-like intelligence across various cognitive tasks. One such benchmark to measure general intelligence in AI systems is the Abstraction and Reasoning Corpus (ARC), a collection of tasks designed to mimic the fluid intelligence and problem-solving capabilities of humans. Within ARC, the tasks are presented as programs, where given input-output pairs serve as examples to derive the logic or rule that governs the transformation of inputs into outputs. The ARC challenge has been a formidable one, with state-of-the-art approaches making only incremental progress. In particular, the performance of AI systems on ARC has remained considerably beneath human capabilities.

CodeIt Methodology

In contrast to traditional methods, a recent approach, dubbed CodeIt, offers a breakthrough in the form of a scalable self-improvement strategy for LLMs to tackle such complex tasks. CodeIt is predicated on iterating between two paramount stages: program sampling with hindsight relabeling and learning from prioritized experience replay.

The core idea lies in reframing the ARC tasks into programming-by-examples problems, granting the model a chance to generate programs that match example outputs. Initial training involves ground truth data assimilated through a domain-specific language (DSL), bolstered by mutation methods to foster data augmentation and model familiarity with the DSL syntax.

During the sampling stage, programs are authored using a pretrained LLM policy. Non-compliant and time-intensive samples are culled, while the rest are stored in a replay buffer. These buffered samples are critical for the learning stage as they ensure continual feeding of experiences into the model's training regime, a process that's further optimized through prioritized replay to address the concern of catastrophic forgetting.

Experimental Results

When CodeIt was deployed on the full ARC evaluation dataset, results were impressive: it resolved some 15% of the tasks, thereby setting a new benchmark and eclipsing the prior best neural and symbolic methods. Digging deeper, an examination of the discovered programs revealed their conciseness and diversity compared to baseline alternatives, indicating not only model's efficiency but also its cognitive likeness to human reasoning in program generation.

Impact of CodeIt Components

A series of ablation studies dissected the influence of CodeIt's components on task performance. Results affirmed that each aspect -- the ExIt mechanism, the hindsight relabeling, and the assimilation of prior knowledge from pretrained models -- plays a significant role in enhancing overall task performance. The system’s finesse lies in its capacity to capitalize on these features to consistently refine its approach, seeking more efficient solutions over time.

Conclusion

The paper showcases that CodeIt's self-improving loop, driven by collecting and learning from practical experiences, along with the adoption of prior knowledge sources, emboldens LLMs to achieve remarkable and tangible advancements in tackling benchmarks such as the ARC. This outperformance provides an optimistic outlook for neuro-symbolic AI systems' trajectory in reaching and potentially surpassing human-level reasoning and complexity handling. As AI makes strides in the field of generalized intelligence, the CodeIt method stands as a testament to the fruitful union of experience-based learning and strategic insights from human-like reasoning patterns.