SE-Bench: Benchmarking Self-Evolution

Updated 9 February 2026

SE-Bench is a benchmarking framework that measures self-evolution in AI by testing lifelong learning and true knowledge internalization with obfuscated APIs.
It employs a closed-book paradigm on both single-call and multi-call tasks to rigorously assess software issue resolution and compositional generalization.
The framework reveals key optimization challenges in supervised fine-tuning, reinforcement learning, and self-play, offering actionable insights for robust AI development.

SE-Bench refers to a family of benchmarks centered on software engineering (SE) tasks, with a particular emphasis on the recently introduced "SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization" (Yuan et al., 4 Feb 2026). This work represents a shift from traditional static knowledge benchmarks to a focus on lifelong learning, ability for knowledge internalization, and self-evolving behavior in LLMs and AI coding agents. The SE-Bench ecosystem now encompasses both classical benchmarks for software issue resolution and diagnostic frameworks for evaluating lifelong learning in artificial agents.

1. Historical Context and Motivation

The SE-Bench initiative arises in a landscape dominated by benchmarks such as SWE-Bench and its derivatives (e.g., SWE-Sharp-Bench for C#, Multi-SWE-Bench for multilingual tasks), which have focused on repository-level issue resolution and patch generation. These benchmarks have spurred progress in AI-driven code synthesis and automated bug fixing, providing granular metrics on agent performance in real-world software repositories (Mhatre et al., 4 Nov 2025). However, previous efforts have not directly measured an agent's ability to internalize genuinely novel knowledge—i.e., learning new APIs or conceptual primitives de novo rather than through dataset contamination or context-window retrieval.

The core motivation for SE-Bench is to rigorously diagnose self-evolution: the ability of an autonomous agent to continually update its internal parameters with newly encountered knowledge in a setting where prior training provides no relevant information and inference-time context is unavailable. This problem is complicated by two confounds:

Prior knowledge entanglement: distinguishing true learning of novelty from recollection of pre-existing information.
Reasoning complexity entanglement: separating failures due to lack of knowledge from failures due to general cognitive or computational limits.

SE-Bench creates a needle-in-the-haystack diagnostic where success is impossible without true internalization and trivial once the mapping is learned, providing an unambiguous testbed for self-evolution (Yuan et al., 4 Feb 2026).

2. Formal Framework and Diagnostic Environment

SE-Bench defines self-evolution formally: An agent $A$ self-evolves if, after parameter updates, its responses to new tasks improve on instances where all external references are absent. Knowledge internalization is the property that a piece of knowledge $K$ is encoded in weights such that closed-book performance (no context) equals performance with context.

Obfuscation Protocol

Selects 268 NumPy functions, each obfuscated into a pseudo-novel Python package (zwc) with randomized API identifiers.
Prevents trivial escapes by encapsulating all arrays in a custom class (ZWCArray) with only the obfuscated interface exposed.
Zero-shot baseline performance is rigorously 0%; no agent can succeed without first internalizing the documentation mapping.

Task Generation

Single-function tasks: LLMs generate isolated coding problems solvable by exactly one obfuscated API call, with at least 8 test cases per instance.
Multi-function tasks: Randomized composite tasks requiring compositions of at least three distinct zwc API calls.
Conservative consensus filtering ensures only tasks solvable by multiple expert models are retained.

Train/Test Partition

Training set: 718 single-call tasks (every function appears at least once).
Test set: 259 new single-call tasks (knowledge retention) and 440 compositional (multi-call) tasks (compositional generalization).

3. Learning Paradigms and Experimental Methodology

SE-Bench is explicitly designed to compare core agents and optimization regimes under controlled novelty.

Supervised Fine-Tuning (SFT)

Trajectory collection phase includes both problem text and relevant docstring.
Parameter update phase (closed-book): docstring is removed, enforcing compression of the mapping into model parameters.
The SFT loss is

$\mathcal{L}_{\mathrm{SFT}}(\theta) = -\mathbb{E}_{(x,y)\sim\mathcal{D}}\sum_{t=1}^T\log\pi_\theta(y_t|x, y_{<t}).$

Reinforcement Learning (RL, PPO-based)

PPO objective with clipped probability ratios and normalized advantages:

$\mathcal{L}_{\rm RL}(\theta) =\mathbb{E}_{x\sim\mathcal{D}}\biggl[\frac{1}{N}\sum_{i=1}^N\rho_i(\theta)A_i\biggr]$

where $\rho_i(\theta) = \min\left(r_i(\theta), \mathrm{clip}(r_i(\theta), 1-\epsilon, 1+\epsilon)\right)$ .

Rewards: +1 for exact correctness, 0 otherwise.

Self-Play

Inspired by strategies like Absolute-Zero, alternates between curriculum self-generation and closed-SFT optimization.
Demonstrates that LLMs can bootstrap knowledge via self-generated data, but only closed-SFT injects the mapping into model weights.

Metrics

Rigor: Solutions must (i) pass all cases, (ii) demonstrate AST-level use of only zwc.*, and (iii) prohibit NumPy access.
Pass@k: Top-64 solution acceptance rate reported for context upper bound, with closed-book zero-shot always 0%.
Retention vs. Generalization: Single-call evaluates recall, multi-call probes ability to compose internalized knowledge.

4. Key Findings and Algorithmic Insights

SE-Bench experiments uncover several mechanisms and limitations underpinning self-evolving LLMs:

Open-Book Paradox

Supervised fine-tuning (SFT) with documentation present at update time leads to 0% knowledge retention: the agent learns to depend on context rather than encode mappings in weights.
Closed-book SFT (doc removed during updates) achieves up to 39.6% single-call accuracy on Qwen3-8B and loses no performance if doc is reintroduced at test time. This demonstrates that explicit context removal is necessary for knowledge internalization.

The RL Gap

PPO-style RL fails entirely (0% closed-book), regardless of whether training is open- or closed-book.
Ablations highlight that
- PPO's probability-ratio clipping impedes large weight updates needed for rare or entirely novel tokens.
- Negative advantage gradients can erase partial associations during early learning phases.
Modifying PPO to remove these obstacles restores performance to SFT levels, suggesting that canonical RL objectives are poorly suited for acquisition of atomic, novel mappings.

Viability of Self-Play with SFT

RL-driven self-play yields 0% retention.
Self-play with closed-SFT achieves 22.5% on single-call and 8.7% on multi-call test tasks, establishing that LLMs can self-generate effective learning curricula if paired with the correct optimization protocol.

Error Dynamics

SFT-trained models primarily suffer from “hallucination” failures (inventing nonexistent API attributes).
Additional RL consolidation effectively collapses hallucinations by biasing generations toward robust, previously acquired mappings.

5. Implications for Lifelong Learning and Benchmarking

SE-Bench provides a unique, contamination-free, and highly diagnostic environment for assessing whether and how agents convert context or demonstration into parametrized skill. This fills a critical gap left by static benchmarks, which often conflate memory, context-retrieval, and true internal learning (Bhatia et al., 12 Jul 2025, Zhang et al., 29 May 2025).

Extensibility and Future Research

The obfuscation and closed-book paradigm can be systematically transferred to other libraries (Pandas, SciPy) or even to language semantics, enabling a scalable family of synthetic lifelong learning benchmarks.
Hybrid optimization strategies are suggested: generate data with rich context but enforce context removal during parameter updates ("Open-Book Paradox" inspired hybrid SFT).
RL may be reintegrated for late-stage consolidation once knowledge fragments have been internalized.
SE-Bench acts as a "unit test" for when an LLM transitions from context-reliance to direct parameter mastery.

6. Relation to Classical SE-Benchmarks

The nomenclature "SE-Bench" is overloaded. Beyond the self-evolution benchmark (Yuan et al., 4 Feb 2026), SE-Bench (shorthand for "Software Engineering Benchmarks") frequently denotes the family of repository-level benchmarks such as SWE-Bench, SWE-Sharp-Bench, and their successors (Mhatre et al., 4 Nov 2025, Zhang et al., 29 May 2025). These traditional SE-Bench tasks assess an agent's ability to resolve issues in real-world repositories—typically by producing patches that cause failing tests to pass under strict, reproducible environment constraints.

While these benchmarks are essential for tracking progress in practical code synthesis, they do not—by construction—disambiguate true learning of novelty from retrieval or prior exposure. Recent variants (e.g., SWE-bench-Live) have moved toward continuous, contamination-resistant evaluation but still principally target code-generation, bug-fixing, or feature-fulfillment under known APIs (Zhang et al., 29 May 2025). Thus, the self-evolution flavor of SE-Bench occupies a complementary position in the benchmarking ecosystem, specifically targeting intrinsic learning dynamics under strictly novel conditions.

Benchmark class	Core Objective	Novelty Isolation	Example Reference
SWE-Bench	Issue/pull request resolution	No	(Zhang et al., 29 May 2025)
SWE-Sharp-Bench	Repository-level C# bug-fixing	No	(Mhatre et al., 4 Nov 2025)
SE-Bench (Self-Evol.)	Knowledge internalization	Yes (obfuscated)	(Yuan et al., 4 Feb 2026)

This taxonomy underscores the methodological innovation of SE-Bench (Self-Evolution) in the broader context of software engineering evaluation frameworks.

7. Significance and Future Directions

SE-Bench's closed-book, obfuscated protocol constitutes a gold-standard diagnostic for tracking the transition from context dependence to true weight-based skill acquisition in LLMs. This is directly relevant for research communities aiming to build lifelong learning agents, develop safe and generalizable AI systems, and understand intrinsic model limitations. SE-Bench also exposes concrete optimization bottlenecks—particularly with canonical RL techniques—clarifying necessary adaptations for effective self-evolving AI.

Planned research extensions include:

Obfuscation of additional APIs or full language semantics to probe granularity of skill acquisition.
Construction of benchmarks with more complex compositionality, memory, or planning demands.
Algorithmic innovation in optimization routines for parameterizing genuinely novel functionality.
Integration into the broader SE-Bench family to better measure downstream impact of knowledge internalization on real-world software engineering tasks.

SE-Bench thus establishes a reproducible, scalable, and unambiguously interpretable framework at the intersection of lifelong machine learning, agentic reasoning, and software engineering evaluation (Yuan et al., 4 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (4)

SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization (2026)

SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks (2025)

SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation (2025)

SWE-bench Goes Live! (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SE-Bench.