ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory (2509.04439v1)

Published 4 Sep 2025 in cs.AI, cs.CL, and cs.LG

Abstract: While inference-time scaling enables LLMs to carry out increasingly long and capable reasoning traces, the patterns and insights uncovered during these traces are immediately discarded once the context window is reset for a new query. External memory is a natural way to persist these discoveries, and recent work has shown clear benefits for reasoning-intensive tasks. We see an opportunity to make such memories more broadly reusable and scalable by moving beyond instance-based memory entries (e.g. exact query/response pairs, or summaries tightly coupled with the original problem context) toward concept-level memory: reusable, modular abstractions distilled from solution traces and stored in natural language. For future queries, relevant concepts are selectively retrieved and integrated into the prompt, enabling test-time continual learning without weight updates. Our design introduces new strategies for abstracting takeaways from rollouts and retrieving entries for new queries, promoting reuse and allowing memory to expand with additional experiences. On the challenging ARC-AGI benchmark, our method yields a 7.5% relative gain over a strong no-memory baseline with performance continuing to scale with inference compute. We find abstract concepts to be the most consistent memory design, outscoring the baseline at all tested inference compute scales. Moreover, we confirm that dynamically updating memory during test-time outperforms an otherwise identical fixed memory setting with additional attempts, supporting the hypothesis that solving more problems and abstracting more patterns to memory enables further solutions in a form of self-improvement. Code available at https://github.com/matt-seb-ho/arc_memo.

Summary

The paper introduces a concept-level memory system that abstracts and modularizes LLM reasoning to enable persistent, reusable, and compositional knowledge.
It implements structured (PS) and flexible (OE) memory formats with explicit read/write operations to boost inference efficiency and generalization.
Empirical evaluations on the ARC-AGI-1 benchmark show that ArcMemo-PS outperforms no-memory baselines and scales effectively with continual learning.

ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory

Motivation and Problem Setting

LLMs have demonstrated strong performance on reasoning-intensive tasks, but their inference-time discoveries are ephemeral—once the context window is reset, any patterns or strategies uncovered are lost. This is in contrast to human problem solving, where abstracted insights and modular concepts are retained and recombined for future tasks. Existing external memory approaches for LLMs have primarily focused on instance-level storage (e.g., query/response pairs or tightly coupled summaries), which limits generalization and scalability. ArcMemo addresses this by introducing a concept-level memory system that abstracts and modularizes reasoning patterns, enabling persistent, reusable, and compositional knowledge for continual test-time learning.

Figure 1: Instance-level memory stores tightly coupled rules, while abstract memory decomposes and modularizes concepts for flexible recombination across tasks.

ArcMemo Framework: Abstraction and Modularity

ArcMemo formalizes external memory as a collection of entries with explicit read and write operations. The core design axes are:

Memory Format: What is stored—instance-level vs. abstract, modular concepts.
Memory Write: How solution traces are abstracted and persisted.
Memory Read: How relevant concepts are retrieved and integrated for new queries.

The framework supports two memory formats:

Open-Ended (OE): Minimal structure, with each entry as a (situation, suggestion) pair. This format is flexible but can lead to overspecification.
Program Synthesis (PS): Structured, parameterized concepts inspired by software engineering, with type annotations and higher-order functions to promote abstraction and modularity.
Figure 2: ArcMemo’s method diagram highlights abstraction and modularity, with parameterization and typed interfaces enabling compositional reasoning.

Figure 3: OE concepts defer abstraction to the model, while PS concepts enforce structure and parameterization, supporting higher-order behavior and modularity.

Memory Operations: Abstraction, Selection, and Continual Learning

Memory Write: Concept Abstraction

OE Abstraction: Summarizes solution traces into reusable (situation, suggestion) pairs, using post-hoc derivations to disentangle context and generalize ideas.
PS Abstraction: Converts solutions into parameterized routines, leveraging pseudocode to focus on high-level operations. The abstraction process is memory-aware, encouraging reuse and revision of existing concepts, and supports higher-order functions for further generalization.

Memory Read: Concept Selection

OE Selection: Uses a vision-LLM to caption the input, then matches against stored situations to select the top- $k$ relevant concepts. This preprocessing step aligns abstract domain inputs with memory representations.
PS Selection: Employs reasoning-based (System 2) selection, where the model explores the problem space, identifies relevant concepts via cues and type annotations, and recursively fills in parameters, supporting compositional assembly of solutions.

Continual Concept Learning

ArcMemo supports continual updates at test time, ingesting new solution traces and abstracting additional concepts. This enables self-improvement as new problems are solved, with memory growing and evolving over time. The system is sensitive to problem order, introducing an accuracy-throughput trade-off in batched inference.

Experimental Evaluation

Benchmark and Setup

ArcMemo is evaluated on the ARC-AGI-1 benchmark, which requires inferring transformation rules for input/output pixel grids. The evaluation uses a subset of 100 puzzles from the public validation split, with memory initialized from 160 seed solutions. The main backbone is OpenAI’s o4-mini model, with auxiliary tasks handled by GPT-4.1.

Main Results

ArcMemo-PS achieves a 7.5% relative gain over a strong no-memory baseline (oracle@2: 59.33 vs. 55.17), and is the only memory design to consistently outperform the baseline across all inference compute scales.
Performance continues to scale with additional retries and parallel samples, with ArcMemo-PS showing the largest gains in low-compute regimes, indicating effective reuse of prior experience.
Ablation studies confirm that the reasoning-based selection mechanism improves both performance and token efficiency, with selection-free variants using more tokens and solving fewer unique puzzles.
Figure 4: Token efficiency plot shows ArcMemo-PS achieves higher official scores with fewer tokens compared to baselines, demonstrating efficient reuse of abstracted concepts.

Continual Learning

Continual memory updates during evaluation yield further improvements, especially at higher retry depths. This supports the hypothesis that solving more problems and abstracting new patterns enables further solutions, manifesting a form of self-improvement.

Implementation Considerations

Memory Format: PS format requires careful design of parameterization and type annotations. Higher-order functions should be supported to maximize abstraction.
Abstraction Pipeline: For PS, preprocess solutions into pseudocode before abstraction. Use few-shot demonstrations and comprehensive instructions to scaffold memory writing.
Selection Mechanism: For OE, leverage VLMs for input captioning and semantic matching. For PS, implement a recursive reasoning-based selection that can handle compositional assembly.
Scalability: Selective retrieval is essential to prevent context window overflow and maintain efficiency as memory grows.
Feedback Integration: The system requires some form of test-time feedback (e.g., test cases, self-reflection) to ensure only productive patterns are persisted.
Order Sensitivity: Continual updates introduce dependency on problem order; batching and curriculum strategies may be needed for robustness.

Implications and Future Directions

ArcMemo demonstrates that abstract, modular memory enables LLMs to persist and reuse reasoning patterns, supporting continual learning and compositional generalization without weight updates. The approach is particularly effective in settings where inference compute is limited, and memory can substitute for redundant exploration. The findings suggest several avenues for future research:

Hierarchical Memory Consolidation: Developing mechanisms for memory consolidation and hierarchical abstraction to further improve scalability and robustness.
Order-Robust Update Strategies: Reducing sensitivity to problem order via curriculum learning or batch-aware updates.
Broader Evaluation: Extending to more challenging benchmarks and real-world domains to quantify upper bounds of memory-augmented reasoning.
Optimization of Representations: Exploring richer abstraction methods, retrieval strategies, and memory organization for improved performance.

Conclusion

ArcMemo introduces a principled framework for abstract, modular, and continually updatable memory in LLMs, enabling persistent compositional reasoning and test-time self-improvement. The empirical results on ARC-AGI-1 validate the efficacy of concept-level memory, with strong gains over instance-level and cheatsheet-style baselines. The work opens new directions for lifelong learning in LLMs, emphasizing abstraction, modularity, and efficient memory management as key ingredients for scalable, generalizable reasoning systems.