Can Large Language Models Reinvent Foundational Algorithms?

Published 7 Apr 2026 in cs.AI | (2604.05716v1)

Abstract: LLMs have shown strong potential to advance scientific discovery. Whether they possess the capacity for foundational innovation, however, remains an open question. In this work, we focus on a prerequisite for foundational innovation: can LLMs reinvent foundational algorithms in computer science? Our \textit{Unlearn-and-Reinvent} pipeline applies LLM unlearning to remove a specific foundational algorithm, such as Dijkstra's or Euclid's algorithm, from an LLM's pretrained knowledge, and then tests whether the model can reinvent it in a controlled environment. To enable effective unlearning, we adopt a GRPO-based, on-policy unlearning method. Across 10 target algorithms, 3 strong open-weight models, and 3 hint levels, our experiments demonstrate that (1) the strongest model Qwen3-4B-Thinking-2507 successfully reinvents 50% of the algorithms with no hint, 70% at hint level 1, and 90% at hint level 2; (2) a few high-level hints can enhance the reinvention success rate, but even step-by-step hints fail for those complicated algorithms; and (3) test-time reinforcement learning enables successful reinvention for the Strassen algorithm at hint level 2. Through analyses of output trajectories and ablation studies, we find that generative verifier in the reinvention phase plays a critical role in sustaining models' reasoning strength, helping to avoid the ``thought collapse'' phenomenon. These findings offer insights into both the potential and current limits of LLMs' innovative thinking.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper demonstrates a novel Unlearn-and-Reinvent pipeline that enables LLMs to intentionally forget and then reconstruct canonical algorithms.
It employs GRPO-based on-policy unlearning and hierarchical prompting to achieve a 50% to 90% success rate for reinventing simpler algorithms.
Test-time reinforcement learning and verifier feedback significantly enhance algorithmic synthesis, though complex algorithms like KMP remain challenging.

Can LLMs Reinvent Foundational Algorithms? — An Expert Analysis

Introduction

This work examines whether LLMs possess the capacity to independently reinvent foundational algorithms from computer science after explicit removal of such knowledge via a post hoc unlearning procedure. The investigation centers on the proposed Unlearn-and-Reinvent pipeline, which conducts parametric algorithmic ablation at the LLM level and subsequently probes the potential for de novo algorithm synthesis. The setting thus critically separates genuine innovation from memorized retrieval, allowing for a systematic study of LLM-driven algorithmic invention.

Figure 1: The Unlearn-and-Reinvent pipeline structure, comprising an on-policy LLM unlearning phase and a reinvention phase that optionally utilizes hierarchical hints and verifier-mediated RL-based exploration.

Methodology

Unlearning Phase: GRPO-Based On-Policy Algorithmic Erasure

The study circumvents the computational cost of retraining LLMs on curated corpora by employing an on-policy, Group Relative Policy Optimization (GRPO) based unlearning approach. The unlearning objective is formulated to maximize forgetting of specific algorithmic knowledge ( $\mathcal{D}_{\text{forget}}$ ), while preserving general model utility via retention sets ( $\mathcal{D}_{\text{retain}}$ ). The cold start mechanism ensures that reward attribution is non-degenerate at initialization.

The reward design prohibits three categories of reward hacking:

Knowledge leakage (either explicit or implicit conceptual echoing)
Algorithm-name hallucination/name corruption
Collapse of response coherence (gibberish outputs)

As demonstrated empirically, this robust design achieves near-complete forgetting (Forgetting Rate $\sim$ 100%) across ten canonical targets while maintaining general coding, math, and function-calling capabilities.

Reinvention Phase: Controlled Algorithm Discovery

Post-unlearning, the LLM is evaluated on its ability to synthesize functional solutions to problems classically solved by the elided algorithms. Reinvention tasks are instantiated as code-writing prompts embedded in a sandboxed Python interpreter, scoped by strong performance and resource constraints, and tested against private cases.

Hierarchical prompting provides orthogonal intervention:

No hint (bare task description)
Level 1: Conceptual hint (e.g., high-level strategy)
Level 2: Detailed stepwise outline (sub-algorithmic granularity)

Failures are diagnosed by a generative verifier, which supplies semantic feedback. This setup supports a high-resolution analysis of latent inventiveness and facilitates test-time RL via reward shaping, targeting correctness and execution efficiency.

Main Empirical Findings

Algorithm Reinvention Capacity and its Boundaries

Qwen3-4B-Thinking-2507, the strongest backbone tested, independently reinvents up to 50% of foundational targets (e.g., Gray Code, Euclidean GCD, certain shortest path algorithms) with zero external hints. Reinvention rates climb to 90% at stepwise guidance (level 2 hints), but high algorithmic complexity—exemplified by KMP, Manacher, and Strassen—remains out of reach for all models despite maximal prompting.

Figure 2: Divergent reinvention trajectories for Dijkstra’s algorithm — one converging on a successful synthesis via iterative repair and verifier feedback, the other falling into a non-productive local minimum.

Effect of External Hints

Stepwise monotonic improvements in Reinvention Success Rate (RSR) are observed with increased hint granularity, particularly for moderately challenging graph problems. However, even exhaustive prompts are consistently insufficient to enable reinvention of KMP and Strassen, indicating that LLMs, post-unlearning, lack the inductive mechanisms for non-local algorithmic synthesis.

Test-Time Reinforcement Learning

Test-time RL over candidate solutions further enhances exploration and solution quality. Strikingly, test-time RL enables successful reinvention of Strassen's algorithm at level 2, where static decoding utterly fails.

Figure 3: Reward curve for Strassen’s algorithm (level 2), showing reward emergence and convergence to a viable solution post-test-time RL.

Figure 4: Complete RL reward curves across targets; only Strassen (level 2) admits reward signal sufficient for successful optimization.

Sustained Exploration: Role of Generative Verifier

Verifiers substantially mitigate the “thought collapse” phenomenon, where outputs degenerate into brevity and lack of exploration under self-critique. Ablation reveals that rounds with verifier feedback show lengthened interaction traces and higher convergence rates. Oracle verifiers further amplify these effects but the benefit saturates rapidly.

Figure 5: Evolution of candidate output length over rounds, with and without verifier feedback. Verifiers preserve exploration depth and delay collapse.

Analysis and Theoretical Implications

The results demonstrate that current LLMs, absent explicit memorized knowledge, can reconstruct a significant subset of algorithmic primitives through program search and verifier-aided feedback, but remain confined by the compositional complexity barrier. Only those classes for which the solution space is small or the inductive bias aligns with algorithmic locality are tractable.

Test-time RL acts as a proximal optimizer, capable of exploiting weak prior signals and diagnostic gradients, but does not instantiate fundamentally new algorithmic schemas. In practice, the discovered behavior aligns with path-search in solution space, limited by model-internal heuristics and availability of guiding information.

This technique—parametric algorithm ablation via unlearning, followed by structured reinvention—institutes a falsifiable protocol for measuring LLM "innovativeness" under controlled exposure and forms a compelling blueprint for computational creativity studies.

Limitations and Future Work

While the pipeline achieves high empirical forgetting rates, full removal of algorithmic priors cannot be formally certified due to the limits of post hoc unlearning (as opposed to corpus-level retraining). The evaluation is constrained to 10 handpicked algorithms, raising open questions about broader generalizability to theories, scientific law discovery, or open-ended mathematical conjectures.

Future research directions include:

Extending ablation–reinvention protocols to broader classes such as theorem proving and empirical science
Scaling up to larger LLMs with explicit mechanism probing at the representation level
Formalizing the hardness of algorithmic rediscovery as a function of inductive complexity and solution space size

Conclusion

This work rigorously demonstrates that open-weight LLMs, after targeted algorithmic unlearning, retain the capacity to synthesize a non-trivial but limited subset of foundational computer science algorithms. The critical dependencies are the problem structure and the availability of structured feedback and hints. The boundary between tractable and intractable reinvention is sharply delineated by the internal reasoning biases, model size, and inferential search capabilities. The introduction of guided ablation-reinvention pipelines is projected to become a key methodology in the systematic study of generative scientific discovery in autonomous AI agents.