Papers
Topics
Authors
Recent
2000 character limit reached

RepoReason: Repository-Level Code Reasoning

Updated 9 January 2026
  • RepoReason is a white-box diagnostic benchmark that evaluates repository-level code reasoning in LLMs via abductive assertion verification.
  • It employs dynamic program slicing and cognitive metrics (ESV, MCL, and DFI) to identify context overload, state tracking deficits, and aggregation bottlenecks.
  • The framework integrates reproducibility assessments with execution-driven mutations to provide actionable diagnostics for agentic software engineering.

RepoReason is a white-box diagnostic benchmark and analytic framework designed to evaluate and dissect the repository-level code reasoning abilities of autonomous agents, particularly LLMs. Unlike prior benchmarks that focus on isolated code snippets or black-box outputs, RepoReason centers on abductive assertion verification across deeply interdependent, real-world repositories, extracting granular, quantitative explanations of cognitive bottlenecks encountered by agentic LLMs (Li et al., 7 Jan 2026).

1. Motivation and Problem Setting

The evaluation of agent-level reasoning has shifted from small, local code fragments (“laboratory” settings) to entire software repositories that exhibit substantial inter-file dependencies, long call chains, and semantic intricacies characteristic of production-grade systems. The central task is to determine whether an autonomous agent can reconstruct the precise state of a multi-file Python codebase after complex interactions, rather than merely generating or editing isolated segments. This approach emphasizes semantic understanding: verifying that the agent maintains logical consistency throughout extensive, mutable environments. RepoReason addresses the urgent need for white-box diagnostics—explaining not just if but why agentic reasoning fails, such as due to exceeding context capacity, state-tracking limitations, or integrative bottlenecks.

In real-world software engineering, agents must traverse cross-file relationships (imports, inheritance, metaprogramming) and ascertain that behavioral invariants encoded as assertions are upheld after intricate transformations. This focus aligns with agentic “software engineers” that prioritize behavioral verification (detecting regressions, optimizing pipelines, or triaging bugs) over surface-level code generation (Li et al., 7 Jan 2026).

2. Abductive Assertion Verification and Execution-Driven Mutation

RepoReason operationalizes repository-level semantic reasoning via abductive assertion verification. The process involves extracting authentic unit-test assertions serving as semantic anchors. Each assertion’s outcome is masked, compelling the model to infer the masked value. For example:

1
assert len(cache) == <mask>

To solve for the mask, the model must reconstruct the full execution path across multi-module call graphs, imports, class hierarchies, and transactional state updates. This constitutes abductive reasoning in its strict sense: inferring the most plausible explanation (masked value) that renders the execution consistent.

An execution-driven mutation framework mitigates contamination from memorized public tests. Each repository undergoes a teacher-LM-guided mutation:

  1. Visual mutations (renaming variables, reordering imports, altering comments),
  2. Semantic mutations (modifying inputs, constants, fixtures),
  3. Invariant preservation of the API call sequence, maintaining logical complexity.

Ground-truth target values for masked assertions are obtained by probe injection (instrumented print statements), re-execution in controlled environments, and style-preserving regeneration of assertions using deterministic value filtering. Only deterministic, single-valued assertions are retained, ensuring the semantic oracle provides stable targets (Li et al., 7 Jan 2026).

3. Dynamic Program Slicing and Cognitive Metrics

RepoReason employs dynamic program slicing to isolate the minimal causal computational subgraph relevant to each assertion. This backward slice traverses all data and control dependencies influencing the asserted value, potentially spanning multiple files and call chains, but excluding extraneous code.

Three orthogonal metrics are computed for each benchmark instance:

  • ESV (Effective Sliced Volume): Measures reading load as the cumulative source code (in LoC) of all functions/methods present in the slice, normalized to a 600 LoC baseline.

ESV=fsliceLoC(f)600\mathrm{ESV} = \sum_{f \in \text{slice}} \frac{\mathrm{LoC}(f)}{600}

  • MCL (Mutation Chain Length): Quantifies simulation depth as the sum of all executed statements in the slice, weighted by execution frequency and normalized to 100 steps.

MCL=sslicefreq(s)100\mathrm{MCL} = \sum_{s \in \text{slice}} \frac{\mathrm{freq}(s)}{100}

  • DFI (Dependency Fan-in): Captures integration width as the number of distinct upstream external inputs to the slice (e.g., constructor args, globals), normalized by 20.

DFI={external inputs}20\mathrm{DFI} = \frac{|\{\text{external inputs}\}|}{20}

High ESV reflects a risk of context overload (the model must hold large code contexts); high MCL exposes a state-tracking deficit (long chains of state transitions); high DFI signals an aggregation deficit (many independent constraints to synthesize) (Li et al., 7 Jan 2026).

4. Experimental Findings and Cognitive Bottlenecks

RepoReason’s white-box evaluation of frontier LLM-based agents reveals the following results:

  • Overall Pass@1 accuracy:
    • Claude-4.5-Sonnet: 66.98%
    • DeepSeek-v3.1-Terminus: 60.96%
    • GPT-5.2: 56.86%
  • Aggregation deficit (high DFI) is the dominant bottleneck: As DFI exceeds 20, model accuracy falls below 40% for all but Claude-4.5, with the steepest observed accuracy decline and the highest negative correlation (Pearson ρ(accuracy,DFI)=0.234\rho(\text{accuracy}, \text{DFI}) = -0.234 for GPT-5.2; vs. ρ(accuracy,ESV)=0.188\rho(\text{accuracy}, \text{ESV}) = -0.188, ρ(accuracy,MCL)=0.158\rho(\text{accuracy}, \text{MCL}) = -0.158).
  • Context and simulation “cliffs”: Error rates rise sharply above \sim600 LoC (ESV) and logical consistency degrades beyond \sim100 mutation steps (MCL).

A plausible implication is that while agents can tolerate significant context size and moderate state-update depth, their principal failure is in parallel integration of multiple independent facts—a synthesis bottleneck (Li et al., 7 Jan 2026).

5. Integration with Reproducibility Assessment and Repository Diagnostics

RepoReason framework incorporates reproducibility assessment workflows based on automated Readme parsing, as established in prior work (Akdeniz et al., 2023). Using the Papers-with-Code reproducibility checklist, sections are classified and scored via either similarity-based embedding methods (Sentence-BERT, cosine similarity) or hierarchical attention transformers. The section-based classifier provides transparent, actionable diagnostics, highlighting missing or inadequate documentation per checklist item. Direct integration with RepoReason allows repositories to be flagged for both semantic code reasoning and reproducibility deficits—enabling full-loop, structured reporting and targeted recommendations.

6. Implications for Model Design and Agentic Software Engineering

RepoReason’s attribution of failures to context overload, state tracking deficit, or aggregation deficit enables targeted architectural interventions:

  • Memory augmentation (e.g., external memory or knowledge graphs) to mitigate context overload,
  • State summarization mechanisms (e.g., automated symbolic traces) to buffer long execution chains,
  • Constraint aggregation modules (e.g., structured attention, neural selectors) for synthesizing high-fan-in dependencies.

Agentic workflows are anticipated to shift toward verification-first processes—periodically executing mutated tests and using execution oracles to ground predictions—rather than mere code synthesis. This design philosophy is reinforced by the transparent, per-instance cognitive profiling established by RepoReason benchmarks (Li et al., 7 Jan 2026).

7. Theoretical Context and Relationship to Preference Optimization

In the context of learning from human preferences, ReLU-based Preference Optimization (RePO) provides a theoretical rationale (“RepoReason,” editor's term) for simplifying offline preference optimization algorithms for language agents (Wu et al., 10 Mar 2025). By studying margin-based surrogates, it is shown that eliminating unnecessary hyperparameters and focusing on the hard margin via a ReLU-max-margin loss coincides with the convex envelope of the 0–1 preference loss. Although distinct in implementation from abductive code reasoning, this theoretical parallel underscores the analytical rigor favored in RepoReason’s design and evaluation methodology.


RepoReason defines a new paradigm for benchmarking and dissecting agentic code understanding at repository scale. By unifying abductive assertion verification, dynamic program slicing, and transparent cognitive metrication—and by crosslinking reproducibility analysis—RepoReason supplies both a practical benchmark and a roadmap for advancing agentic, full-stack software engineering with LLM-based agents (Li et al., 7 Jan 2026, Akdeniz et al., 2023, Wu et al., 10 Mar 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to RepoReason.