Human Cognition in Machines: A Unified Perspective of World Models

Published 17 Apr 2026 in cs.RO, cs.AI, cs.CV, and cs.ET | (2604.16592v1)

Abstract: This comprehensive report distinguishes prior works by the cognitive functions they innovate. Many works claim an almost "human-like" cognitive capability in their world models. To evaluate these claims requires a proper grounding in first principles in Cognitive Architecture Theory (CAT). We present a conceptual unified framework for world models that fully incorporates all the cognitive functions associated with CAT (i.e. memory, perception, language, reasoning, imagining, motivation, and meta-cognition) and identify gaps in the research as a guide for future states of the art. In particular, we find that motivation (especially intrinsic motivation) and meta-cognition remain drastically under-researched, and we propose concrete directions informed by active inference and global workspace theory to address them. We further introduce Epistemic World Models, a new category encompassing agent frameworks for scientific discovery that operate over structured knowledge. Our taxonomy, applied across video, embodied, and epistemic world models, suggests research directions where prior taxonomies have not.

Abstract PDF Upgrade to Chat

Authors (22)

First 10 authors:

Summary

The paper's main contribution is a taxonomy aligning World Models with cognitive functions, stressing underexplored intrinsic motivation and meta-cognition.
It introduces a unified modular framework that integrates multi-modal perception, memory, language, and reasoning through a systematic mapping to Cognitive Architecture Theory.
The study reveals significant gaps in current WM designs and suggests harnessing agentic, epistemic approaches for robust human-like machine cognition.

A Unified Perspective on World Models: Integrating Human and Machine Cognition

Introduction

"Human Cognition in Machines: A Unified Perspective of World Models" (2604.16592) presents a systematic analysis of World Models (WMs) by grounding their functions in Cognitive Architecture Theory (CAT). The authors establish a novel taxonomy of WMs based on the cognitive functions they emulate—memory, perception, language, reasoning, imagination, motivation, and meta-cognition—rather than prior taxonomies focusing on application or architecture. Special emphasis is placed on identifying two under-researched cognitive pillars: motivation (especially intrinsic motivation) and meta-cognition. Further, the report introduces Epistemic World Models as a new class oriented toward agentic scientific discovery in structured knowledge environments.

Figure 1: The survey studies the convergence of human cognition, machine cognition, and World Models, delineating their interdependencies.

Unified Cognitive Architecture: Taxonomy of World Models

The authors ground their taxonomy in Newell’s Cognitive Architecture Theory, mapping WM innovations to core cognitive functions. This approach exposes research coverage and gaps across the field. The proposed taxonomy, visualized in the paper, maps each major WM advancement explicitly onto these cognitive functions.

Figure 2: The taxonomy of World Models aligns with component parts of cognitive architecture theory, emphasizing specific cognitive functions advanced by each WM design.

A critical assertion is that most state-of-the-art WMs achieve partial cognitive function coverage but fail to holistically integrate all facets identified in CAT, particularly intrinsic motivation and meta-cognition. This gap includes both latent video/embodied WMs and agentic frameworks.

The Unified World Model Framework

The core proposal is a modular WM framework that operationalizes representation and generation across all CAT cognitive functions. This framework calls for:

Multi-modal input perception channels.
Latent state-memory for robust downstream reasoning and imagination.
Language interfaces for human-in-the-loop alignment and symbolic abstraction.
Imagination mechanisms supporting hypothetical reasoning and sim-to-real transfer.
Domain-specific reasoning modules for multi-scale planning.
Intrinsic and structured mechanisms for motivation, beyond hand-crafted extrinsic rewards.
Meta-cognitive scaffolds—self-monitoring, self-evaluation, and self-control.
Figure 3: The proposed Unified World Model structure assembles all CAT-based cognitive functions, serving as a conceptual research roadmap.

Video and Embodied World Models: Architectural Trends

Video World Models

State-of-the-art video WMs excel in spatial-temporal consistency, multi-scale perception, and memory. Architectures leverage joint-embedding predictive paradigms (e.g., JEPA, V-JEPA), autoregressive rollouts, and diffusion generation. Long-range temporal coherence is achieved via hierarchical context expansion and surgical memory bottleneck engineering, e.g., VideoWeave’s synthetic long-context splicing and Helios’s context compression.

Figure 4: Archetypal architectures for video World Models, including autoregressive, bidirectional, promptable/action-conditioned, and geometric frameworks.

Physics alignment is increasingly targeted through reward-regulated post-training and inference-time filtering, e.g., Le et al. employ verifiable Newtonian rewards and VJEPA-2 fuses physics critics for ensuring dynamic plausibility during generative rollout.

Embodied World Models

Embodied WMs extend the paradigm to physical tasks, requiring detailed contact geometry encoding, persistent environment memory, and physically plausible reasoning. Applications span robotics, navigation, and autonomous driving. Approaches like BEVWorld leverage unified BEV latents for cross-modal forecasting, while geometric WMs (e.g., PointWorld, OccWorld) anchor 3D structure at the core.

Figure 5: Dominant architectures in embodied World Models, highlighting navigation, manipulation, and multi-modal integration.

In embodied domains, planning and imagination are often fused, with agents training policies purely in latent imagination space (e.g., Dreamer/Nav), then deploying in real or sim-to-real settings. Language-mediated embodied reasoning is operationalized via VLA models (e.g., LingBot-VLA) and closed-loop driving architectures (e.g., DrivingGPT).

Epistemic World Models and Global Workspaces

The paper advances the field by formalizing Epistemic WMs, where the environment is a structured knowledge space. Here, agents manipulate a dynamic state via reasoning, tool-use, and human-in-the-loop feedback, instantiated as explicit global workspaces. Major projects (Gemini Co-scientist, OpenAI Prism, OmniScientist) implement these principles, supporting scientific discovery and verification tasks through multi-agent deliberation, persistent external memory, and procedural workflows.

Figure 6: Canonical architectures and global workspace integrations seen in agentic/epistemic World Models oriented to scientific discovery.

This domain also offers a pathway for integrating meta-cognition and intrinsic motivation—agents critique, evaluate, and route their own cognitive processes, and motivation can be more flexibly defined (e.g., via active inference and self-assigned objectives).

Research Gaps: Motivation and Meta-Cognition

Motivation: The authors claim that almost all practical WMs rely on extrinsic, hand-tuned reward functions. There is minimal adoption of intrinsic motivation principles such as empowerment or active inference. Empirical utilization of active inference has been demonstrated in only isolated cases, despite its strong alignment with human motivation as formulated in theoretical neuroscience.

Meta-Cognition: Latent WMs are largely devoid of self-monitoring or self-control mechanisms. Only agentic/epistemic WMs, via explicit global workspaces and multi-agent protocol layers, begin to instantiate forms of meta-cognitive control and reflection, which is central to human cognition and learning.

Implications and Future Trajectories

The explicit mapping of WM progress and missing functionality has several far-reaching implications:

Full parity with human cognition (even functionally) in AI agents will require integrating all CAT-outlined cognitive components in an operational, synergistic way.
Progress in video and embodied domains should increasingly leverage agentic, workspace-based coordination to drive advances in intrinsic motivation, meta-cognition, and cross-domain transfer.
Epistemic World Models show early promise for generalizable autonomous agents, especially in scientific and collaborative workflows. Further convergence between latent and epistemic WM design paradigms is predicted.
The field is likely to see cross-pollination between neuro-inspired concepts (active inference, global workspace theory) and practical architectures (MoE, world-aware routing, persistent external memory) for the emergence of robust meta-cognitive and self-motivating AI.

Conclusion

This report proposes a theoretically unified, CAT-informed WM taxonomy, revealing significant under-exploration of intrinsic motivation and meta-cognition by grounding all claims in cognitive science. Distinguishing between latent, embodied, and epistemic WMs, it positions integrated global workspaces and active inference as promising research targets. The holistic Unified World Model outlined provides both a structured research agenda and a vocabulary for evaluating anthropomorphic claims in AI, emphasizing that human-level cognitive emulation in machines cannot be achieved without explicitly addressing all major cognitive functions, especially those currently under-researched.

Markdown Report Issue