Dual-Model Reflexion: Architectures & Insights

Updated 30 June 2026

Dual-Model Reflexion is an AI paradigm that decouples reasoning and self-critique into specialized models, effectively mitigating biases and enhancing decision quality.
This modular architecture, exemplified by systems like MAR, DARS, and REFLEX, leverages independent critique and consensus synthesis for iterative refinement.
Empirical results reveal performance gains up to 6.2 percentage points on tasks such as HumanEval, underscoring improved error recovery and sample efficiency.

Dual-Model Reflexion denotes a spectrum of architectures and algorithms that address the limitations of pure single-agent introspective reflection in LLMs and vision-LLMs (VLMs). In such systems, self-critique and iterative answer refinement are decomposed—either structurally or functionally—across two (or more) specialized models, typically assuming distinct roles such as reasoning/acting and critique/reflection. Dual-model systems improve diversity, transparency, and effectiveness of reflection, mitigating well-documented issues such as degeneration of thought, confirmation bias, and uninformative self-assessment. Implementations of dual-model (or more generally, multi-agent) reflexion encompass both text-based and multimodal domains, target both policy improvement and perceptual refinement, and employ various interaction protocols between the constituent models.

1. Fundamental Architectures and Roles

Several instantiations of dual-model reflexion emerge in recent research, each formalizing the separation of reasoning and critiquing capacities.

In MAR (Multi-Agent Reflexion), the Actor generates an initial response and reasoning chain given the accumulated context of past reflections, while multiple debater models (with configurable personae, e.g., Verifier, Logician) independently critique this reasoning. A Judge module then synthesizes these critiques into a consensus reflection, which conditions the Actor’s next attempt (Ozer et al., 23 Dec 2025).
In DARS (Dual-Model Reasoner-Critic System), a Reasoner proposes and iteratively refines answers to a task, while a Critic model inspects each proposal, emitting either a targeted verbal critique or a termination token. The Critic’s interaction halts the reflection process once the proposal is assessed correct (Li et al., 26 Feb 2025).
Reflective Perception (RePer) in visual domains alternates between a policy model (perceptual prediction) and a critic model (which scores answers and issues textual feedback) to drive multi-turn perceptual refinement (Wei et al., 9 Apr 2025).
REFLEX decouples code diagnosis and repair: a vision-enabled Critic distills programmatic behavioral evidence into structured, auditable diagnoses, which the Actor uses (alongside a Skill Memory) to synthesize improved policies (Wang, 15 Jun 2026).
SRPO for multimodal reasoning leverages a frozen reflection generator (Critic) to analyze initial outputs and a separate policy model (Actor) to revise solutions in response (Wan et al., 2 Jun 2025).

The design rationale is to structurally separate error detection, critique/rationale generation, and corrective synthesis, instead of relying on a single model for all cognitive steps.

2. Mathematical Formalisms and Algorithmic Workflows

Dual-model reflexion systems are formalized through stepwise operator flows and probabilistic models:

In MAR, let $q$ be the task, $\mathcal{R}_t$ be the set of reflections at step $t$ . The Actor $M_a$ generates $(\tau_t, y_t)$ , conditioned on $\mathcal{R}_t$ . Each Debater $D_j$ produces critiques $c_j^{(r)}$ , possibly over multiple rounds with interleaved access to peer critiques. The Judge $J$ aggregates these into a single reflection $r_t$ , forming a new context for the next Actor attempt. The process halts if the Evaluator $\mathcal{R}_t$ 0 returns $\mathcal{R}_t$ 1 (Ozer et al., 23 Dec 2025).
DARS inference comprises: $\mathcal{R}_t$ 2 (initial rationale), loop: $\mathcal{R}_t$ 3 (emits reflection or $\mathcal{R}_t$ 4); if not stopped, $\mathcal{R}_t$ 5; iterate (Li et al., 26 Feb 2025).
RePer alternates: $\mathcal{R}_t$ 6, $\mathcal{R}_t$ 7, then for $\mathcal{R}_t$ 8, $\mathcal{R}_t$ 9, $t$ 0, exiting on convergence (Wei et al., 9 Apr 2025).
REFLEX's Critic is modeled by $t$ 1 over structured diagnoses $t$ 2; the Actor samples $t$ 3, with Skill Memory retrieval and cross-run code transfer (Wang, 15 Jun 2026).
In SRPO, after a self-reflection generator emits a critique, the policy model performs an explicit self-correction phase. In reflection-aware RL, reward decomposes into task and reflection quality, promoting concise, cognitively-meaningful reflection steps (Wan et al., 2 Jun 2025).

This architectural modularity is represented in both staged pseudocode (as in MAR, REFLEX) and in explicit loss decompositions for joint training or supervised fine-tuning.

3. Experimental Results and Performance

Dual-model reflexion consistently improves over single-model or naive reflection baselines across multiple domains:

System	Task	Baseline (%)	Single-Model Reflexion (%)	Dual-Model/MA Reflexion (%)	Gain Over Single
MAR (Ozer et al., 23 Dec 2025)	HotPotQA EM	32.0	44.0	47.0	+3.0 pp
MAR	HumanEval	67.1	76.4	82.6	+6.2 pp
REFLEX (Wang, 15 Jun 2026)	Lunar Lander NWS	1.098	(monolithic)	1.092	comparable (faster, fewer calls)
SRPO (Wan et al., 2 Jun 2025)	MathVista	72.3 (GRPO)	–	75.8	+3.5 pp
RePer (Wei et al., 9 Apr 2025)	DetailCaps (CAPTURE)	51.03	–	52.89	+1.86
DARS (Li et al., 26 Feb 2025)	ASAS ACC/F1/QWK	see text	SFT/DPO	+5/+11/+2	↑ all metrics

These gains are attributed to increased error-diversity in critique, higher-quality reflections, and the avoidance of stalling/degeneration phenomena seen in introspective looping.

4. Design Considerations and Ablation Insights

Principal design choices include the number and type of critic models (personae), stopping criteria, conversation rounds, and temperature settings:

MAR shows that increasing the number of debaters ( $t$ 4) has diminishing returns but linearly increases cost. Two debate rounds suffice for >95% disagreement capture.
Critic persona temperatures critically modulate quality: lower temperatures focus strictness, higher enable exploratory reasoning.
In DARS, the separation of Reasoner and Critic is essential: merging the heads causes negative gains; scaling Critic size yields superlinear metric improvement.
RePer and SRPO emphasize controlled reflection granularity, balancing reflection step count for maximum gain; rewards in SRPO penalize redundancy and reward effective self-correction or brevity.

Ablation studies unanimously support dual-path or multi-agent separation as essential for error recovery, interpretable feedback, and sample efficiency.

5. Transparency, Interpretability, and Knowledge Transfer

A key virtue of dual-model reflexion is transparency: Critic outputs, such as structured feedback or explicit rationale, enable stepwise inspection and diagnosis of where reflection or correction failed. In REFLEX, the Critic’s diagnosis, pre-repair, makes each code mutation auditable; persistent Skill Memories facilitate knowledge transfer across independent runs and tasks (Wang, 15 Jun 2026). In educational/automated evaluation contexts (e.g., DARS), the Critic’s feedback localizes errors to specific rubric components, underpinning explainable AI recommendations (Li et al., 26 Feb 2025).

Skill Memory and cross-task transfer evidence in REFLEX and RePer further demonstrate that decoupled architectures promote reusable knowledge primitives, improving convergence in new or related domains.

6. Application Domains and Generalizations

Dual-model reflexion is instantiated in:

Multi-hop text QA and program synthesis (MAR), with empirical gains over baseline and single-agent reflexion (Ozer et al., 23 Dec 2025).
Automated code synthesis from behavioral evidence, enabling diagnoses that generalize across mutation histories and tasks (REFLEX) (Wang, 15 Jun 2026).
Multimodal reasoning (SRPO) and perceptual refinement (RePer), where policy and critic models alternate over language-image pairs or code/image RL loops (Wan et al., 2 Jun 2025, Wei et al., 9 Apr 2025).
Automated student answer scoring, where iterative Reasoner-Critic dialog provides localized remediation and explanation (DARS) (Li et al., 26 Feb 2025).
Complex embodied or web agents (DuSAR), where holistic and local strategies are interleaved via reflective assessment (Zhang et al., 9 Dec 2025).

These paradigms extend naturally to visual question answering, robotics (policy+feasibility check), and interactive assistants, though actual generalization is evaluated per system and domain. In some systems (e.g., DuSAR), both strategies are internal to a single model but functionally distinct; in others, actual model separation is enforced.

7. Limitations and Open Challenges

Documented limitations include:

Marginal increases in inference cost, scaling linearly with the number of Critic/Actor calls or reflection turns (e.g., MAR incurs $t$ 5 LLM calls).
Quality bottlenecked by Critic accuracy; erroneous reflections can mislead refinement, with error rates observed (e.g., 36% of DARS reflections inaccurate) (Li et al., 26 Feb 2025).
Stopping criteria, credit assignment to individual reflection steps, and optimal hyperparameter settings are typically handcrafted.
Transfer to video, 3D, or persistent learning settings remains open.
For small models, external demonstrations or traces improve holistic reflection, but the default (zero-shot, demonstration-free) dual-strategy suffices for larger models (Zhang et al., 9 Dec 2025).

A plausible implication is that further advances in critic learning, adaptive debate/summarization, and meta-reflection (e.g., learning when and how to reflect) could increase both efficacy and efficiency in dual-model reflexion. Cross-domain evaluation, lightweight critics, and integration of uncertainty estimation for Critic modules are active areas of study.