Dual-Mind World Model (DMWM)

Updated 20 February 2026

DMWM is a dual-process framework combining fast, RSSM-based inference with slow, logic-integrated reasoning for robust long-horizon planning.
It employs an inter-system feedback mechanism to blend statistical modeling with symbolic logic, enhancing logical consistency and sample efficiency.
Empirical evaluations demonstrate improved performance in reinforcement learning, robot control, and network scheduling with increased interpretability.

The Dual-Mind World Model (DMWM) is a framework for learning, imagination, and long-horizon planning in partially observable domains, directly inspired by dual-process theories in cognitive science. DMWM integrates a fast, intuitive dynamics model—System 1—based on Recurrent State-Space Model (RSSM) architectures, with a slow, logical reasoning engine—System 2—implemented as a Logic-Integrated Neural Network (LINN). Through an inter-system feedback mechanism, the model achieves long-term policy learning characterized by logical consistency, substantial sample efficiency, and robust generalization, particularly in reinforcement learning (RL) and model-based planning scenarios (Wang et al., 11 Feb 2025). Variants and extensions have been proposed for tasks ranging from automated scheduling in cyber-physical systems (Dutta et al., 4 Feb 2026), to physical control and visual imagination in robotics (Chi et al., 23 Jun 2025), and to wireless network scheduling with explicit logical constraints (Wang et al., 28 Oct 2025). This entry details the principles, technical methodology, optimization strategies, and empirical properties of DMWM grounded in the primary research literature.

1. Theoretical Foundations and Dual-Process Architecture

The DMWM is built on the dual-process cognitive paradigm, wherein human cognition is posited to alternate between a rapid, heuristics-based "System 1" and a deliberative, logically-consistent "System 2." In DMWM, System 1 (RSSM-S1) is responsible for latent-state inference and efficient one-step transition modeling. It operates in latent space using variational inference with a recurrent neural core, closely following DreamerV3’s methodology with a sequence ELBO that regularizes both reconstruction and latent dynamics:

$\mathcal{L}_{\rm S1} = \sum_{t=1}^T \left[ -\log p_{\theta}(o_t|h_t,z_t) -\log p_{\theta}(r_t|h_t,z_t)\right] + \lambda_{\rm dyn}\,{\rm KL}[\mathrm{sg}(q_{\phi}(z_t|h_t,o_t))\Vert p_{\theta}(z_t|h_t)] + \lambda_{\rm rep}\,{\rm KL}[q_{\phi}(z_t|h_t,o_t)\Vert {\rm sg}(p_{\theta}(z_t|h_t))]$

System 2 (LINN-S2) integrates symbolic logic into neural computation. It encodes state and action embeddings as logic vectors and applies differentiable analogs of logical operators—conjunction (AND), disjunction (OR), negation (NOT), and implication (IMPLY)—to impose hierarchical, temporal logical constraints over imagined trajectories. Logical inference is enforced both locally (single-step implications) and globally (implication chains spanning the imagined horizon) (Wang et al., 11 Feb 2025, Wang et al., 28 Oct 2025).

2. Inter-System Feedback and Logical Consistency Enforcement

The hallmark of DMWM is a bidirectional feedback circuit between Systems 1 and 2. During standard training, transitions from real interaction are first modeled by RSSM-S1 and then passed to LINN-S2, which estimates the logical inference loss over the labeled implication chain $(C_t \to z_{t+1})$ and updates its internal rules (Eq. 12, (Wang et al., 11 Feb 2025)).

Conversely, during imagination rollouts, System 2’s logical modules evaluate the logical-consistency of candidate state transitions and inject a differentiable logic-consistency score, $C(\omega(z_t, z_{t-1}, a_{t-1})) \in [0,1]$ , back into the RSSM-S1 ELBO as an additional regularization term:

$\log p(o_{1:T}|a_{1:T-1}) \geq \sum_{t=1}^T \mathbb{E}_{q}\left[\log p(o_t|z_t)\right] + \sum_{t=1}^T \mathbb{E}_q\left[\log C\bigl(\omega(z_t, z_{t-1}, a_{t-1})\bigr)\right] - \sum_{t=1}^T {\rm KL}\left[q(z_t|\cdot)\Vert p(z_t|z_{t-1},a_{t-1})\right]$

This integration constrains latent exploration and policy optimization to respect both learned statistical structure and explicit logical conditions, thereby reducing cumulative error over long-horizon imagination.

3. Detailed Methodology: Model Components and Training

The DMWM training objective is a compositional loss that integrates:

RSSM-style variational loss with logic feedback ( $\mathcal{L}_{\rm S1}$ )
LINN logical regularization and deep implication chain penalties ( $\mathcal{L}_{\rm S2}$ )
Standard RL or model-predictive control objectives governing the action-planning component

$\mathcal{L}_{\rm total} =\; \mathcal{L}_{\rm S1}\;+\;\mathcal{L}_{\rm S2} \;+\;\mathcal{L}_{\rm actor/critic\;or\;MPC}$

Training employs a logic curriculum: logic-regularization weights ( $\beta_{\rm reg}$ ) are initially set to zero (for warm-up) and ramped linearly to their target value, while the reasoning depth $\alpha$ in the LINN-S2 module is progressively increased, supporting more advanced implication chains and greater temporal abstraction. System 1 and 2 are alternately updated alongside the policy head in a loop consisting of real-world data collection, imagination rollouts, and logic-informed policy improvement (Wang et al., 11 Feb 2025).

4. Experimental Evaluation and Empirical Properties

DMWM exhibits consistent improvements in logical consistency, sample efficiency, and planning depth compared to state-of-the-art RSSM-based world models and hierarchical baselines:

Metric	DMWM	DreamerV3	Hieros	HRSSM
Logical consistency (H=30)	0.723±0.039	0.643±0.131	0.689±0.113	0.695±0.087
Trial efficiency (500 trials)	5.5× baseline	–	–	–
Data efficiency (1e5 steps)	+32%	–	–	–
Long-term imagination (H>30)	+120%	–	–	–

Removal of System 2 logic feedback induces an 8% drop in logical consistency and a 25% decrease in long-horizon average return. DMWM’s feedback mechanism also improves interpretability by providing logic-regularized embeddings, making trajectory predictions traceable and enabling validation of logical correctness over planning horizons (Wang et al., 11 Feb 2025). Related scheduling and control applications utilize the same dual-principle to outperform both deep RL and heuristic baselines while maintaining symbolic transparency (Dutta et al., 4 Feb 2026, Wang et al., 28 Oct 2025).

5. Extensions and Domain-Specific Implementations

Domain-adaptive forms of DMWM have been presented:

Networked Systems: Scheduling with real-time digital twins, merging lightweight heuristics (Fast Mind) with symbolic planners (Slow Mind) and ICN filtering to satisfy scheduling constraints, yielding superior throughput and deadline compliance (Dutta et al., 4 Feb 2026).
Wireless Networks: End-to-end differentiable planning over symbolically-constrained RSSM rollouts for age-of-information minimization, with strong results in urban mmWave V2X scenarios and significant gains in generalization and adaptation to unseen topologies (Wang et al., 28 Oct 2025).
Robot Control: Coupled diffusion models (for visual imagination and high-frequency action planning) aligned via a cross-modal matcher, with the dual-mind paradigm enabling closed-loop manipulation and latent risk assessment (Chi et al., 23 Jun 2025).

DMWM also shares deep conceptual alignment with dual-process ToM models (Manir et al., 10 Sep 2025), showing that combining a habitual inference engine and a meta-adaptive reasoner can recreate human-like bias, context-sensitivity, and generalization across combinatorial tasks.

6. Strengths, Limitations, and Future Directions

DMWM establishes a new standard for logical consistency over long-horizon latent imagination, substantially advances sample efficiency, and adds interpretability through explicit logic regularization. These gains come with added computational complexity (notably in the logic-reasoning chain) and possible tradeoffs in the depth of logical inference, with over-deep reasoning chains potentially leading to symbolic noise and increased computational cost (Wang et al., 28 Oct 2025). Future extensions may incorporate more flexible or scalable symbolic reasoning, end-to-end integration with success classifiers, or deeper integration with LLMs for narrative planning.

DMWM’s two-system architecture, feedback coupling, and logic-regularized training have established its empirical superiority and generality for complex RL and planning domains requiring both rapid statistical inference and explicit logical reasoning (Wang et al., 11 Feb 2025, Wang et al., 28 Oct 2025, Dutta et al., 4 Feb 2026).