Reflection-Centered Backbones (ReflAct)

Updated 30 March 2026

The paper presents the ReflAct backbone that integrates explicit reflection using fused state and goal representations to reliably guide LLM actions.
Reflection-centered Backbones are advanced reasoning frameworks that continuously align an LLM’s internal belief state with fixed goals for improved task performance.
Empirical evaluations on ALFWorld demonstrate that ReflAct outperforms prior methods, achieving a 93.3% success rate and mitigating error propagation.

Reflection-centered Backbones (ReflAct) are a class of reasoning backbones for LLM agents that enforce explicit, ongoing grounding of the agent’s internal belief state with respect to a fixed goal at every timestep. Designed as a direct response to the limitations of prior methods like ReAct, which interleave "thought" and "action" but frequently produce ungrounded or incoherent reasoning, ReflAct introduces a structured reflection mechanism that systematically fuses the current belief and task objective before every action. This architectural change has led to substantial empirical gains in complex interactive environments, notably surpassing established backbones on benchmarks such as ALFWorld.

1. Theoretical Foundations and Formalization

The agent–environment interaction in ReflAct is formulated as a partially observable Markov decision process (POMDP) $\mathcal{M} = \langle \mathcal{S},\mathcal{A},\mathcal{O},\mathcal{P},\mathcal{R}\rangle$ , where $s_t\in\mathcal{S}$ denotes hidden states, $o_t\in\mathcal{O}$ are observations, and $g\in\mathcal{U}$ is the (fixed) natural-language goal instruction. The agent’s internal belief state at time $t$ , $s_t$ , summarizes the full interaction history $h_t$ , constructed as

$s_t = B(h_t)$

in which $B(\cdot)$ is a (potentially implicit) belief-state estimator realized by the LLM.

Central to ReflAct is the reflection function $R(s_t, g)$ , which produces a reflection vector $r_t$ by combining state and goal encodings and their interactions:

$r_t = R(s_t, g) = \phi_s(s_t) + \phi_g(g) + \Psi(\phi_s(s_t), \phi_g(g)),$

where $\phi_s(\cdot)$ and $\phi_g(\cdot)$ are learned encoders for the state and goal, and $\Psi$ is a fusion network (such as cross-attention or MLP) that captures their interactions. This vector forms the context for explicit natural-language reflection and subsequent action selection.

For each step, $r_t$ is provided to the LLM, prompting it to "reflect on your current state in relation to the task goal," yielding a reflection text $\kappa_t$ that grounds the agent's next move. The policy for action selection, $\pi^{\mathrm{act}}$ , is thus conditioned on both $r_t$ and $\kappa_t$ , ensuring actions are directly goal-aligned and state-aware (Kim et al., 21 May 2025).

2. Algorithmic Structure

The ReflAct backbone follows a cyclical sequence wherein grounding and reflection precede every decision point. The core workflow is:

initialize observation o_0, history h_0 ← {u}
infer initial belief s_0 ← B(h_0)
for t = 0 to T−1 do
  embed state φ_s ← φ_s(s_t)
  embed goal  φ_g ← φ_g(g)
  fuse        Ψ_out ← Ψ(φ_s, φ_g)
  reflection_ctxt r_t ← φ_s + φ_g + Ψ_out
  κ_t ← LLM_reflect(r_t)
  a_t ← LLM_act(r_t, κ_t)
  execute(a_t) → o_{t+1}
  h_{t+1} ← h_t ⧺ [κ_t, a_t, o_{t+1}]
  s_{t+1} ← B(h_{t+1})
end for

This structure ensures: joint construction of belief and goal context, forced reflection on this context before any action, and update of both history and internal state at every loop iteration. Each action is grounded not just on the past actions and observations, but on an explicit, context-driven summary that adheres to the current goal. This mitigates the compounding of errors known to afflict ungrounded chain-of-thought and prevents progressive drift from the agent's actual state (Kim et al., 21 May 2025).

The ReAct backbone alternates between generating "thoughts" $\tau_t$ and actions $a_t$ , each sampled as follows:

$\tau_t \sim \pi_\theta^{\mathrm{thought}}(\cdot\mid c_t), \qquad a_t \sim \pi_\theta^{\mathrm{act}}(\cdot\mid c_t\oplus\tau_t),$

where $c_t = (h_t, o_t)$ . However, ReAct's thoughts $\tau_t$ often lack systematic reminders of the ultimate goal, and their internal state summarization can drift or compound errors, leading to hallucinations and misalignments.

ReflAct replaces this with a structured reflection $\kappa_t$ derived from a fusion of the current belief and goal:

$\kappa_t \sim \pi_\theta^{\mathrm{reflect}}(\cdot\mid r_t), \qquad a_t \sim \pi_\theta^{\mathrm{act}}(\cdot\mid r_t, \kappa_t).$

Because every reflection $\kappa_t$ is re-anchored in the actual internal belief state $s_t$ and the fixed goal $g$ , the agent avoids error propagation and the gradual semantic drift that ReAct can suffer from. The explicit recomputation of $R(s_t, g)$ at every step acts as a persistent anchor, eliminating reflection chains that lose sight of the task objective (Kim et al., 21 May 2025).

4. Empirical Evaluation and Benchmarking

Empirical studies of ReflAct were conducted on the ALFWorld suite of 134 household tasks. The evaluation metric was binary success rate (SR) per task. Comparative results across baselines (using the GPT-4o model) are summarized as follows:

Backbone	Success Rate (SR)	Relative Improvement over ReAct
NoThinking	76.1%	–
Plan-and-Act	85.8%	+0.8 pts
ReAct	85.1%	–
ReflAct	93.3%	+8.2 pts / +27.7%

ReflAct achieved a 93.3% SR in ALFWorld, outperforming ReAct by 8.2 percentage points (+27.7% relative). Furthermore, ReflAct avoided all new failure cases observed in ReAct or NoThinking; every failure encountered was a case where baselines also failed. Notably, ReflAct outperformed versions of ReAct augmented with enhancement modules, supporting the centrality of backbone structure in reliable reasoning (Kim et al., 21 May 2025).

5. Mechanisms Mitigating Compounding Errors

The principal mechanism by which ReflAct enhances agent reliability is the enforcement of explicit, up-to-date goal-state alignment at every timestep. Three key factors contribute to mitigation of misalignment and error propagation:

The agent’s internal belief, made explicit in the LLM input, gets consistently updated and conditioned with the current task goal.
Continuous reflections ( $\kappa_t$ ) act as persistent reminders of the intended objective, offsetting tendencies toward locally optimal yet globally suboptimal decisions.
Systematic re-anchoring breaks the chain of hallucinations that often result from ungrounded or drifting chain-of-thought, as seen in conventional backbones.

This reflection-centered paradigm prevents the accrual of errors in belief tracking and maintains robust alignment in long-horizon, partially observed domains (Kim et al., 21 May 2025).

6. Implications and Extensions

The reflection-centered structure facilitates a series of downstream and cross-domain extensions, including:

Multi-agent systems: Agents may share individual reflection vectors $r_t$ to synchronize beliefs and objectives, enhancing coordination and collective decision making.
Robotics: Integration of visual state embeddings and textual goal encodings via the fusion network enables unified control strategies for embodied agents.
Formal domains: In code generation or theorem proving, reflecting on partial proof states and the final theorem provides a pathway for reliable next-step selection.

A plausible implication is that any agent architecture requiring persistent alignment between state inference and evolving or persistent objectives may benefit from reflection-centered reasoning backbones. The formalism and empirical data suggest that backbone-level changes induce more reliable behaviors than post-hoc or surface-level enhancement modules (Kim et al., 21 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reflection-centered Backbones (ReflAct).