Cognitive Imagination: Theory & Applications

Updated 4 July 2026

Cognitive imagination is the ability to internally generate and manipulate sensory-like, semantic, and causal representations for reasoning and planning.
Researchers use multimodal techniques such as fMRI decoding, VAE models, and dynamic simulations to analyze its neural and computational bases.
Applications span AI reasoning, robotics, and brain-computer interfaces, highlighting its role in decision making and creative problem solving.

Searching arXiv for recent and foundational papers on cognitive imagination and related computational/neural models. Cognitive imagination denotes a family of internally generated representational processes that support reasoning, prediction, planning, and scenario construction. In recent AI work, it is defined both as the internal generation and manipulation of sensory-like representations—chiefly visual sketches—based on stored knowledge and on-the-fly hypotheses, and as “a faculty to mentally visualize coherent and holistic systems of concepts and causal links that serve as semantic contexts for reasoning, decision making and prediction” (Larabi, 16 Jul 2025, Vityaev et al., 8 Aug 2025). These formulations distinguish cognitive imagination from a narrowly pictorial notion of imagery: some systems treat it as multimodal sketch generation or latent visual simulation, while others treat it as the construction of causally organized internal contexts that guide inference, semantic verification, and action selection (Qi et al., 2019, Vityaev et al., 8 Aug 2025).

1. Conceptual scope

A recurrent distinction in the literature separates cognitive imagination from a mere “picture-in-the-head” model. “Don’t Forget Imagination!” explicitly argues that cognitive imagination is not static visual imagery, but a coherent internal context composed of concepts and causal links, used for reasoning, decision making, and prediction (Vityaev et al., 8 Aug 2025). In the machine-thinking framework of “Can Mental Imagery Improve the Thinking Capabilities of AI Systems?”, mental imagery is instead treated as a first-class reasoning component: a Cognitive Thinking Unit dispatches a high-level stimulus such as “Imagine a cup falling off a table” to a Mental Imagery Unit, receives generated images or sketches, and then inspects those images through sub-questions to refine its inferences (Larabi, 16 Jul 2025).

A second distinction concerns the relation between imagination and perception. “Predicting the imagined contents using brain activation” defines mental imagery as percept-like experiences in the absence of sensory input, and the associated fMRI results were interpreted as showing common, modality-specific, neural correlates for imagery and perception (Miyapuram et al., 2021). “Mind-to-Image” extends this line by separating imagination “from memory” and “from pure imagination,” thereby treating imagination as a decodable neural process rather than only a subjective report (Caselles-Dupré et al., 2024).

In AI, the term also appears in models that couple language, vision, memory, and planning. The LGI network presents cognitive imagination as a brain-like cycle in which language instructions trigger internal generation of visual scenarios and imagined or perceived visuals give rise to language descriptions (Qi et al., 2019). More recent work on spatial reasoning frames imagination as the internal simulation of a state $S_t$ under hypothetical transformations, formalized as $S_{t+1}=f(S_t,a_t)$ , with $S_t$ encoding object geometries, positions, orientations, and inter-object relations (Lian et al., 16 Nov 2025).

2. Neural basis and empirical decoding

The empirical study of cognitive imagination has increasingly relied on fMRI decoding. In the reward-imagery study by Miyapuram, Schultz, and Tobler, 12 participants learned arbitrary conditioned stimuli associated either with a photographic picture of a £20 note or with a scrambled picture. A bilateral ventral midbrain cluster centered in substantia nigra was activated in both perception and imagination, and a linear SVM trained on perception-elicited midbrain $\beta$ -patterns predicted imagination trials with 75% accuracy and an AUC of 0.78 (Miyapuram et al., 2021). The study therefore provided a proof of principle that internal, imagined content could be classified from a relatively small and domain-specific brain region.

“Mind-to-Image” moves from classification to reconstruction. The paper reports, for the first time, a substantial visual-imagery dataset of around 6h of scans, collected from $N=1$ healthy volunteer as a preliminary proof of concept, with weak-imagination and strong-imagination protocols (Caselles-Dupré et al., 2024). In the weak-imagination setting, 1,200 distinct surrealist images were shown for 3 s each, followed by a flash cue and 5 s of pure recall; the protocol yielded approximately 1,125 usable trials, with 75 held out for validation and about 1,050 for training. In the strong-imagination setting, verbal prompts such as “Imagine a landscape evoking fear” were presented across 10 basic emotions and 2 modalities, generating about 200 imagery trials without external ground-truth images (Caselles-Dupré et al., 2024).

The reconstruction pipeline uses GLM deconvolution to extract trialwise $\beta$ -weights, a custom mask spanning visual cortex, fusiform gyrus, medial temporal loci, and prefrontal loci, and a flattened input vector of about 12K voxels. A semantic branch maps voxels to a CLIP-ViT L/14 embedding, and a perceptual branch maps voxels to a Stable Diffusion VAE latent; the loss is $\mathcal{L}=\lambda_1\mathcal{L}_{\text{contrastive}}+\lambda_2\mathcal{L}_{\text{MSE}}$ (Caselles-Dupré et al., 2024). On the weak-imagination validation set of 75 trials, the reported scores were PixCorr $=0.165$ , SSIM $=0.052$ , AlexNet-2way layer 2 $=65.1\%$ , AlexNet-2way layer 5 $S_{t+1}=f(S_t,a_t)$ 0, Inception-v3 2-way $S_{t+1}=f(S_t,a_t)$ 1, CLIP 2-way $S_{t+1}=f(S_t,a_t)$ 2, and portrait-vs.-landscape classifier accuracy $S_{t+1}=f(S_t,a_t)$ 3 (Caselles-Dupré et al., 2024). For strong imagination, no PixCorr or SSIM was available because no ground-truth images existed, but portrait/landscape classification reached 88%, and in about 30% of trials key elements from oral descriptions reappeared in reconstructions (Caselles-Dupré et al., 2024).

These findings were interpreted as supporting a dissociation between semantic and perceptual components of imagery. The same paper states that weak imagination reactivates fusiform and associative visual areas, whereas strong imagination recruits PFC strategies with partial fusiform overlap; it further suggests compatibility with predictive-coding accounts in which top-down signals sculpt early visual areas (Caselles-Dupré et al., 2024).

3. Formal models of internal imagination

The literature formalizes cognitive imagination in several distinct but related ways.

Formulation	Core representation	Function
Semantic model	Factual model + probabilistic causal model	Coherent semantic context (Vityaev et al., 8 Aug 2025)
Internal world model	Weighted graph $S_{t+1}=f(S_t,a_t)$ 4	Read-out of imagined scenarios (Ranjan et al., 5 Oct 2025)
Spatial world model	State $S_{t+1}=f(S_t,a_t)$ 5 with transition $S_{t+1}=f(S_t,a_t)$ 6	Mental simulation of transformations (Lian et al., 16 Nov 2025)
Vision–language loop	$S_{t+1}=f(S_t,a_t)$ 7 into PFC LSTM	Closed thinking loop (Qi et al., 2019)

In semantic-model work, the imagination substrate is explicitly symbolic and causal. The factual tier stores deterministic facts about entities and their links, whereas the causal tier stores probabilistic causal relations written in the same language. A generic rule has the form

$S_{t+1}=f(S_t,a_t)$ 8

with conditional probability

$S_{t+1}=f(S_t,a_t)$ 9

The paper’s central claim is that “maximally specific causal relations” resolve statistical ambiguity, so that contradictory rules do not both apply with high probability, and that this gives imagination a glass-box, manipulable, logically consistent structure (Vityaev et al., 8 Aug 2025).

A network-science formulation appears in “Internal World Models as Imagination Networks in Cognitive Agents.” There, imagination is modeled as access to an internal world model represented by a weighted graph $S_t$ 0, where nodes are imagined scenarios and edge weights are partial correlations in vividness between scenarios (Ranjan et al., 5 Oct 2025). The network is estimated through EBICglasso, and centrality measures such as strength, expected influence, closeness, and betweenness characterize which scenarios are structurally central. Human networks derived from VVIQ-2 and PSIQ showed strong cross-sample correlations for strength, expected influence, and closeness, together with clear community structure, whereas LLM-generated networks often lacked clustering and showed weak or inconsistent alignment with human topologies (Ranjan et al., 5 Oct 2025).

A third family of models is world-model-based and dynamical. “Imagine in Space” defines imagination as iterative application of a learned transition function to a state vector $S_t$ 1 encoding geometry and spatial relations, and argues that powerful spatial imagination must support egocentric–allocentric conversion, arbitrary-axis rotations, and projection prediction (Lian et al., 16 Nov 2025). “Human-like machine thinking: Language guided imagination” operationalizes a related idea in a multimodal LSTM architecture: the PFC subsystem receives

$S_t$ 2

updates a working-memory LSTM, and predicts

$S_t$ 3

thereby closing a loop in which language cues produce imagined visual transformations and imagined visuals feed back into language (Qi et al., 2019).

4. Generative, reconstructive, and multimodal implementations

A foundational generative account is “Generative Models of Visually Grounded Imagination,” which defines visually grounded imagination as the ability to create images of novel semantic concepts. The model modifies a VAE with joint, image-only, and attribute-only encoders, introduces the Triple-ELBO objective, and uses a product-of-experts inference network for partially specified concepts:

$S_t$ 4

The paper evaluates imagination through the “3 C’s”: correctness, coverage, and compositionality, and reports on MNIST-with-attributes and CelebA that the model can synthesize plausible held-out combinations such as “bald + female” despite those combinations being absent from training (Vedantam et al., 2017).

A more constrained model appears in “Seeking the Building Blocks of Visual Imagery and Creativity in a Cognitively Inspired Neural Network.” Its disentangled-feature VAE splits an 8-dimensional latent into 4-dimensional shape and 4-dimensional color components, with symbolic one-hot inputs for digit class and color prototype (Hedayati et al., 2021). The study shows that the model can “perfectly re-imagine any trained combination,” but cannot generate genuinely novel shape–color conjunctions that were never experienced during training. The authors therefore argue that memory-based replay and recombination, rather than feedforward interpolation alone, are needed for creativity-like behavior (Hedayati et al., 2021).

Several later systems make imagination an explicit intermediate reasoning modality. “Visualize Before You Write” introduces iNLG, in which a StableDiffusion v1-1 image generated from a text context is encoded into a visual prefix for a LLM, trained with cross-entropy plus an InfoNCE alignment loss (Zhu et al., 2022). On few-shot ROCStories, the reported rep-4 decreases from approximately 14.41% to 3.42%, diversity rises from 52.10 to 81.36, MAUVE rises from 9.10 to 35.94, and BERTScore rises from 21.23 to 23.03 (Zhu et al., 2022). The framework therefore treats imagination as a blueprint for text generation rather than only as a downstream visualization.

“Thinking with Generated Images” goes further by allowing a unified autoregressive multimodal transformer to generate intermediate visual subgoals and self-critiqued visual hypotheses within its own chain of thought (Chern et al., 28 May 2025). The reported GenEval “Two Obj.” score rises from 0.38 for Anole-7b to 0.57 for the subgoal model and 0.59 for the self-critique final output; the paper summarizes this as up to 50% relative improvement in complex multi-object scenarios (Chern et al., 28 May 2025). “Do multimodal models imagine electric sheep?” studies a different mechanism: a Qwen3.5 VLM trained only to predict open-loop action sequences on twelve spatial puzzles develops hidden states that encode visually decodable intermediate world states, and explicitly integrating 16 visual tokens per step increases average solve rate from 83% to 89%, with especially strong gains on jigsaw and 3D mental rotation (Ramakrishnan et al., 10 May 2026).

5. Robotics, agents, and machine-thinking systems

In robotics, cognitive imagination is often operationalized as latent scene completion or future-state simulation. “Spatial Imagination With Semantic Cognition for Mobile Robots” defines semantic cognition as learning a category-specific prior over the top-down shape and layout of objects, and spatial imagination as filling in occluded or unobserved regions of a 2D top-down semantic map from current RGB-D observations (Shen et al., 2021). The architecture combines a ResNet-18 encoder with a U-Net decoder, optimized by a weighted binary cross-entropy that emphasizes occupied cells and plausible context regions. On Matterport3D scenes with sparse viewpoints, the reported IoU improvements are chairs: 0.155 to 0.197, tables: 0.189 to 0.202, and beds: 0.379 to 0.386, with chair imagination adding about 1.5 M extra correct pixels over HRNet (Shen et al., 2021).

The machine-thinking framework of “Can Mental Imagery Improve the Thinking Capabilities of AI Systems?” couples an Input Data Unit, Needs Unit, Mental Imagery Unit, and Cognitive Thinking Unit (Larabi, 16 Jul 2025). Raw sensory inputs are captioned into sentences $S_t$ 5, internal goals are stored as sentences $S_t$ 6, and the Mental Imagery Unit generates images $S_t$ 7 from CTU prompts. The CTU then computes embeddings $S_t$ 8, $S_t$ 9, and $\beta$ 0, updates its internal state by

$\beta$ 1

and produces actions or further queries (Larabi, 16 Jul 2025). In the reported desk-scene validation, the best context–need match was “a laptop computer with a bunch of keys on it” at cosine $\beta$ 2, and generated actions such as “Pick up the keys and open the door…” achieved 92% n-gram overlap with ground truth, while a hydration example reached 88% overlap (Larabi, 16 Jul 2025).

Long-horizon action models make imagination explicitly temporal. “DMWM: Dual-Mind World Model with Long-Term Imagination” combines an RSSM-based System 1 with a logic-integrated neural System 2, using inter-system feedback so that imagination follows logical rules of the environment (Wang et al., 11 Feb 2025). The paper reports up to 5.5× higher test returns under identical trial budgets, approximately 32% higher final returns in complex tasks, and up to 120% higher returns than Dreamer or GD-MPC at long horizons (Wang et al., 11 Feb 2025). “MemoryVLA++” extends VLA models with working memory, a Perceptual-Cognitive Memory Bank, and a latent imagination module based on a video diffusion world model; on real robots it reports +9%, +26%, and +28% gains on general, memory-dependent, and imagination-dependent tasks, respectively (Shi et al., 8 Jun 2026).

Applications proposed across these systems include assistive brain–computer interfaces for locked-in patients, witness imagery verification with consent, a “visual diary” of designers’ mental sketches, semantic mapping and collision avoidance in mobile robots, and temporal-consistent robotic manipulation (Caselles-Dupré et al., 2024, Shen et al., 2021, Shi et al., 8 Jun 2026).

6. Constraints, misconceptions, and research directions

A common misconception is that cognitive imagination is exhausted by vivid visual imagery. Several papers explicitly reject that reduction. “Don’t Forget Imagination!” treats semantic context, causal coherence, and context switching as central, and argues that reasoning continually returns to imagined context for semantic verification (Vityaev et al., 8 Aug 2025). The internal-world-model network analysis likewise shifts attention from the vividness of individual images to the topology of relations among imagined scenarios, showing that human imagination networks have recurring cluster structure while current LLM networks often collapse into a single monolithic cluster (Ranjan et al., 5 Oct 2025).

A second misconception is that present-day multimodal models already reason through perception-grounded imagination whenever they solve spatial tasks. “Imagine in Space” reports the opposite pattern for many advanced VLMs: they predominantly rely on linguistic representations, perform poorly on visual-centric tasks such as mental rotation and projection prediction, and show token usage that grows rapidly with transformation complexity (Lian et al., 16 Nov 2025). On visual-centric mental rotation, Gemini 2.5 Pro peaks at 20.5% accuracy while humans score nearly 100%; open-source VLMs remain at or near chance on every task except the simplest Cube Rolling (Lian et al., 16 Nov 2025). The paper’s Imagery Driven Framework improves Qwen-2.5-VL-7B from 12.5% to 42.3% on Cube Rolling, from 20.1% to 44.7% on Rubik’s Cube, and from 4.3% to 7.5% on Mental Rotation, but the gains remain partial (Lian et al., 16 Nov 2025).

Empirical and ethical constraints are equally prominent. The strongest neural reconstruction work still relies on very small subject counts, including the $\beta$ 3 proof of concept in “Mind-to-Image,” where low SSIM and subjective evaluation for pure imagination remain explicit limitations (Caselles-Dupré et al., 2024). Several papers call for multi-subject datasets, richer ground-truth protocols for pure imagination, long-term memory modules, unified multimodal latent spaces, multisensory imagination, and better integration of causal or predictive world models (Caselles-Dupré et al., 2024, Larabi, 16 Jul 2025, Shi et al., 8 Jun 2026). The same literature also raises consent frameworks, “mental privacy” legislation, and the risks attached to technologies that can reconstruct or classify internal states (Caselles-Dupré et al., 2024).

Taken together, the field presents cognitive imagination as an internal representational capacity that can be sensory-like, semantic, causal, or dynamical. Across neuroscience, generative modeling, spatial reasoning, and robotics, the central research problem is not simply how to render images, but how to build internally consistent world models that can be queried, manipulated, and checked against goals, memory, and evidence.