Automated Reasoning Critic (ARC)

Updated 25 February 2026

ARC is a framework that decouples intermediate reasoning generation from evaluation, using structured feedback to improve logical and programmatic outputs.
It leverages neural, symbolic, and hybrid critic architectures to diagnose errors and guide refinement in tasks such as question answering and program synthesis.
Empirical results show ARC systems enhance performance by providing precise step-level corrections, improving accuracy and convergence in iterative reasoning.

The Automated Reasoning Critic (ARC) is a class of system architectures and methodologies that act as automated evaluators and feedback providers for intermediate reasoning steps or solutions produced by humans or artificial agents, especially in complex, multi-step, or open-domain tasks requiring genuine logical, symbolic, or scientific reasoning. The ARC paradigm spans neural, symbolic, and neuro-symbolic approaches, including instantiations as transformer-based critics trained to provide structured feedback, logic-program-checkers for code generation, or adversarial self-play critics for step-wise reasoning verification in LLMs. ARC mechanisms are central to advancing state-of-the-art performance on demanding benchmarks in question answering, program synthesis, and formal logic reasoning, as they facilitate iterative improvement, robust error diagnosis, and alignment with semantic specifications (Paul et al., 2023, Kalyanpur et al., 2024, Chen et al., 27 Apr 2025, Rocha et al., 2024).

1. Core Principles and Formalizations

A central tenet of ARC systems is the explicit decoupling of reasoning step generation (“actor”) from reasoning assessment (“critic”), enabling targeted evaluation, error detection, and refinement beyond scalar reward feedback. Generally, the ARC is defined as a conditional model or program

$\text{ARC}: (x, z) \mapsto f$

where $x$ is the problem context (e.g., a science question, input grid, or premise set), $z$ is a candidate intermediate reasoning step or solution (natural language, equation, logic statement, or program), and $f$ is structured feedback—verdicts, error types, error localization, and potentially structured natural language hints (Paul et al., 2023, Kalyanpur et al., 2024).

Instantiations include:

Sequence-to-sequence Transformer critics generating semi-structured error feedback based on (context, step) pairs (Paul et al., 2023).
Symbolic critics (e.g., ASP, ILP solvers) executing candidate logic programs against formal test suites, returning compilation errors, test failures, and proof-localized counterexamples (Kalyanpur et al., 2024, Rocha et al., 2024).
Adversarial critics in self-play setups, automatically evolving their discriminative power against model-generated “sneaky” errors, using RL signals to iteratively update critic parameters (Chen et al., 27 Apr 2025).

In each ARC system, the feedback signal is utilized by upstream reasoners or actors for iterative process refinement, step rejection/regeneration, or targeted code correction.

2. Critic Architectures and Automated Training Regimes

ARC implementations span from purely neural to deeply symbolic or hybrid. Neural critics are typically encoder-decoder Transformers (e.g., T5/UQA-Base) consuming context and step as concatenated input, and emitting a feedback string or error diagnosis, trained via maximum likelihood on synthetic feedback generated through perturbation schemes (operator swaps, number changes, etc.) or LLM prompt outputs (Paul et al., 2023). Symbolic critics, such as Answer Set Programming (ASP) solvers or Inductive Logic Programming (ILP) engines, receive declarative programs or DSL code and execute semantic test batteries, identifying unsatisfiable constraints or provenance of failures at the level of rules or clauses (Rocha et al., 2024, Kalyanpur et al., 2024).

Automated training obviates the need for expensive human annotation using:

Rule-based or LLM-synthesized error traces and structured feedback, labeling thousands of instances without manual step-by-step supervision (Paul et al., 2023).
Self-play reinforcement learning, in which critic models are pitted against generator agents deliberately crafting hard-to-detect mistakes; the critic continually improves its discrimination capabilities through adversarial cycles (Chen et al., 27 Apr 2025).

This mechanistic design enables robust critic generalization in both neural and symbolic spaces while maintaining scalability to large datasets or highly structured domains.

3. Actor-Critic and Generator-Critic Loops

ARC is often embedded in a closed-loop or iterative refinement protocol, where the generator or actor proposes candidate steps/programs and the critic returns feedback for error localization and correction. In the REFINER framework, the generator $G_\theta$ and critic $C_\phi$ alternate: the generator proposes intermediate steps, the critic issues semi-structured feedback, and the generator updates its subsequent hypothesis using this feedback, with early stopping if the critic emits a "no hint" or "no error" message (Paul et al., 2023).

Table: Canonical ARC Loop Structures

Approaches	Generator Role	Critic Role
REFINER/ARC (Paul et al., 2023)	Generate intermediate natural language steps	Emit structured feedback string
LLM-ARC (Kalyanpur et al., 2024)	Generate logic program + test suite	Execute code/tests, provide failures
ILP-ARC (Rocha et al., 2024)	Generate DSL-based symbolic rules	Test on examples, prune hypotheses
SPC (Chen et al., 27 Apr 2025)	Generate deliberately erroneous steps	Classify step as correct/incorrect

In LLM-ARC, a neuro-symbolic actor generates a declarative logic program and semantic tests; the ARC critic executes the program in an ASP solver, runs the tests, and surfaces error locations, which are then used for actor-side iterative repair until all tests pass or a maximum number of cycles is reached (Kalyanpur et al., 2024).

4. Empirical Performance and Error Analysis

ARC-based systems yield substantial improvements on a range of benchmarks and error metrics by enabling precise, step-level supervision and verification. On the FOLIO logical reasoning task, LLM-ARC attains a new state-of-the-art accuracy of 88.32% using an actor-critic architecture where the neuro-symbolic critic iteratively surfaces semantic code failures and minimal-explanation proofs, driving the LLM-generated logic code to convergence (Kalyanpur et al., 2024). In neural critics, REFINER shows +3–13 points improvement on a range of intermediate and final answer metrics for math word problems, synthetic natural language reasoning, and norm inference, even when used as a drop-in module at inference time with off-the-shelf GPT-3.5 (Paul et al., 2023).

Typical ARC error analysis reveals:

Recurring failures in existential quantification and instance/type conflation due to ASP’s lack of native existential reasoning (Kalyanpur et al., 2024).
Under-parameterized rules involving multiple variables, leading to incomplete logic programs.
Occasional non-convergence—~30% of failed cases saw no changes between iterations, often due to LLM actor ignoring critical feedback or insufficient constraint forcing (Kalyanpur et al., 2024).
For step-level neural critics, adversarially evolving error types make detection increasingly challenging, but also drive up discriminative and verification performance (Chen et al., 27 Apr 2025).

5. Critic Taxonomies, Feedback Formats, and Knowledge/Reasoning Classification

ARC systems benefit from explicit taxonomies of error types, knowledge categories, and reasoning labels to structure feedback and stratify failure analysis. For instance, ARC critics in REFINER are trained to recognize distinct error classes such as “Incorrect Numbers,” “Incorrect Operators,” “Missing Knowledge Link,” and output structured templates (“The operator in #2 is incorrect”) (Paul et al., 2023). Symbolic critics trace failed queries to specific logic rules or object-centric predicates, surfacing proof fragments or contexts for actor-side program repair (Kalyanpur et al., 2024, Rocha et al., 2024).

Systematic annotation frameworks, as developed for science QA in the AI2 Reasoning Challenge, enable critics to categorize questions and failures by knowledge type (Definition, Basic Facts, Causes/Processes, Algebraic) and reasoning type (Multihop, Causal/Explanation, Hypothetical, Physical Model), providing fine-grained diagnosis of system weaknesses and research targets (Boratko et al., 2018).

6. Extensions, Limitations, and Future Research Directions

ARC frameworks encounter limitations in expressivity, generalization, and enforcement. For symbolic critics, existing ASP-based arcs cannot represent existential quantification or higher-order combinators without extending the DSL or shifting to more expressive theorem provers (Kalyanpur et al., 2024, Rocha et al., 2024). Neural critics may plateau in adversarial self-play if feedback is ignored or insufficiently enforced by actor-side architectures (Kalyanpur et al., 2024, Chen et al., 27 Apr 2025).

Open research directions identified include:

Automatic expansion of critic DSLs to capture new object predicates and reasoning templates (Rocha et al., 2024).
Modular integration with multimodal or end-to-end differentiable models, bridging vision, natural language, and logic (Camposampiero et al., 2023).
Enhancing feedback coverage and interpretability by combining neuro-symbolic critics with additional LLMs for proof trace surfacing or function-calling API constraints (Kalyanpur et al., 2024).
Leveraging human-labeled taxonomies to stratify critic performance and drive targeted improvements in question or reasoning categories with high failure rates (Boratko et al., 2018).

In summary, Automated Reasoning Critic frameworks operationalize fine-grained, step-level verification and semantic feedback for reasoning systems, enabling robust iterative improvement in domains where correctness, explicitness, and logical validity are essential. The ARC paradigm constitutes a foundational element for explainable, trustworthy, and generalizable AI reasoning (Paul et al., 2023, Kalyanpur et al., 2024, Chen et al., 27 Apr 2025, Rocha et al., 2024).