Shared World Hypothesis Overview

Updated 6 December 2025

Shared World Hypothesis is a theoretical framework that defines how independent agents develop similar internal representations when exposed to consistent physical, semantic, or social data.
Empirical studies using single-life learning paradigms and geometric alignment demonstrate that models trained on individual data streams achieve near-equivalent performance on downstream tasks.
The hypothesis extends to decentralized multi-agent systems and cognitive frameworks, showing that predictive coding and emergent communication lead to shared, functionally aligned representations.

The Shared World Hypothesis posits that multiple independent agents, be they humans, artificial systems, or models, can converge on highly similar internal representations—or shared conceptual worlds—when exposed to data or experiences grounded in the same underlying physical, semantic, or social environment. This hypothesis spans vision, social cognition, communication, language, dynamics modeling, and representational emergence, and is supported by empirical, theoretical, and modeling evidence across cognitive science and machine learning. Central to its modern formulation is the finding that the structural and statistical regularities inherent to the world act as a strong inductive signal, such that independently learned models—given sufficient data or interaction—reliably encode concordant, interoperable abstractions.

1. Formalization and Motivating Evidence

The hypothesis states that if a set of agents, each having access only to partial, idiosyncratic, or entirely non-overlapping views of the world, attempt to independently learn internal models—whether for perception, prediction, communication, or action—the resulting representations should display strong functional alignment provided these views reflect the same environmental regularities. In "Unique Lives, Shared World: Learning from Single-Life Videos," the claim is made precise for egocentric video data: all personal video streams, irrespective of the individual or the specific environments traversed, are grounded in invariant physical laws (e.g., Euclidean geometry, object permanence), and thus independently trained self-supervised visual models should recover functionally isomorphic geometric encodings (Han et al., 3 Dec 2025).

This assertion contrasts with approaches that require massive, diverse Internet-scale datasets to obtain generalization. Instead, it is hypothesized that inductive bias provided by the world's structural consistency suffices—even within a "single life"—to drive models toward shared internal representations.

2. Single-Life Learning and Geometric Alignment

A notable operationalization is the "single-life" learning paradigm: define a "life" as the egocentric video $\mathcal{D}_i$ recorded by an individual $i$ , and train a model $f_{\theta_i}$ using only $\mathcal{D}_i$ , yielding parameters

$\theta_i^* = \arg\min_{\theta_i} \mathcal{L}(f_{\theta_i}, \mathcal{D}_i).$

No data or gradients are ever shared across these per-life models. Nonetheless, empirical results show that when learning architectures capitalize on natural cross-view redundancy (e.g., cross-view masked autoencoding using a ViT-based encoder and a cross-attention decoder), models trained on separate lives exhibit highly aligned patch-to-patch geometric correspondences.

Alignment is quantitatively assessed by the Correspondence Alignment Score (CAS), based on the overlap of cross-attention maps across independently trained models. CAS values approach those of Internet-trained baselines within hours of training, and quantitative evaluation on monocular depth, zero-shot correspondence, and label-propagation benchmarks confirm equivalent or near-equivalent downstream generalization (Han et al., 3 Dec 2025). Statistical significance (error bars reported; $p<0.01$ ) separates models trained on true single-life data from non-life controls (static or synthetic videos), supporting the core hypothesis.

3. Collective and Decentralized Representation Emergence

The Shared World Hypothesis is further generalized in multi-agent, decentralized contexts. In "Decentralized Collective World Model for Emergent Communication and Coordination," decentralized agents, each with only partial and noisy observations, interact solely through lossy, sample-based message exchange. Despite never accessing each other's internal states, and operating under strict bandwidth/information constraints, agents develop a mutually aligned latent state via predictive coding and contrastive alignment (InfoNCE loss), operationalized as convergence in message-space geometry (Nomura et al., 4 Apr 2025).

Empirically, the decentralized emergent-communication models outperform non-communicative baselines in coordinated tasks, especially under perceptual impoverishment. Representation similarity analysis confirms that the agents' latent symbols coalesce to encode the global state, supporting the emergence of a shared world representation that is both functionally useful (coordination-enhancing) and symbolically meaningful.

4. Theoretical and Cognitive Science Foundations

The hypothesis is also formalized within hierarchical generative modeling and active inference. In "A World unto Itself: Human Communication as Active Inference," shared mental-state alignment is cast as a species-typical adaptive prior $p(n_A \approx n_B)$ over agents' hidden states. Here, communication is a policy for epistemic foraging—specifically, a means of gathering sensory evidence to confirm the assumed alignment of internal world models (Vasil et al., 2019).

Minimization of variational and expected free-energy under this prior yields both a normative explanation for cooperative communication and precise neurocomputational predictions (e.g., interbrain synchrony, hierarchical predictive coding across cortical regions). Ontogenetically and culturally, the hypothesis connects to the evolution of communicative constructions, which serve as stable deontic cues, shaping both joint attention in infancy and the formation of stable grammatical or semantic conventions across populations.

5. Information-Theoretic and Machine Consciousness Models

Extending beyond classical supervised or agent-based settings, the Shared World Hypothesis is explored in substrate-agnostic, distributed predictive systems. "Testing the Machine Consciousness Hypothesis" articulates a framework where local predictive agents, embedded in a cellular automaton "base reality," interact via representational dialogue—message-passing determined by information bottleneck constraints and mutual information maximization. Alignment and emergence of a global shared world are detected via topological markers (disappearance of holes in synergy complexes), and the fixed point of distributed mutual prediction is identified as a collective self-model (Fitz, 30 Nov 2025). Here, the shared world is not merely an environment to be modeled, but the emergent product of codebook alignment among communicating observers.

On the level of individual and social perception, predictive coding neural models have demonstrated that the apparent "closeness" of shared world perceptions in social contexts can be captured by modulating scalar precision parameters. In experiments modeling human spatial reproduction tasks, both prior and sensory precision are adjusted to interpolate smoothly between social, mechanical, and individual task settings (Tsfasman et al., 2022). The social context reduces inter-individual prior variance while enhancing sensory-driven divergence, suggesting that social interaction synchronizes prior beliefs while simultaneously amplifying the impact of shared sensory signals. The underlying mechanism is thus computationally and dynamically instantiated as a continuum, not a binary shift.

7. Generalization Across Language, Modality, and Culture

The Semantic Hub Hypothesis extends the Shared World Hypothesis to LLMs, which are shown to develop a single, modality-agnostic representation space ("hub") in their intermediate layers. Data confirm that equivalent meanings—regardless of surface form (e.g., across languages, code/text, numbers, multimodal fragments)—are mapped to proximate points in this hub (Wu et al., 7 Nov 2024). Perturbations in one modality causally and systematically alter outputs in another, demonstrating that the hub is not only structurally but functionally shared.

Similarly, in multi-task dynamics learning, meta-world models successfully learn a shared latent transition model across environments with radically different observations but identical underlying physics, highlighting the robustness of this inductive property to changes in low-level sensory statistics (Wu et al., 2018). In the domain of cumulative culture, models of hunter-gatherer social evolution establish that the onset of "shared intentions" (the capacity for joint, recursive intention formation) requires sufficiently large group-level benefit to trigger group-wide alignment and the formation of a shared cognitive world (Angus et al., 2015).

In sum, the Shared World Hypothesis has been rigorously formalized and empirically validated across self-supervised visual learning, multi-agent coordination, cognitive neuroscience, language modeling, and social-cultural evolution. Alignment and emergence of shared representation arise as a direct consequence of structural regularities in the environment, mutual predictive interaction, and appropriate inductive bias, often requiring no explicit parameter sharing or supervised correspondence. Open problems remain in scaling these principles to high-dimensional, multi-modal, and longer horizon domains, and in fully characterizing the limits of generalization for non-geometric, semantic, or culturally variant domains.