Emergent Compositional Communication for Latent World Properties

Published 18 Mar 2026 in cs.MA and cs.LG | (2604.03266v1)

Abstract: Can multi-agent communication pressure extract discrete, compositional representations of invisible physical properties from frozen video features? We show that agents communicating through a Gumbel-Softmax bottleneck with iterated learning develop positionally disentangled protocols for latent properties (elasticity, friction, mass ratio) without property labels or supervision on message structure. With 4 agents, 100% of 80 seeds converge to near-perfect compositionality (PosDis=0.999, holdout 98.3%). Controls confirm multi-agent structure -- not bandwidth or temporal coverage -- drives this effect. Causal intervention shows surgical property disruption (~15% drop on targeted property, <3% on others). A controlled backbone comparison reveals that the perceptual prior determines what is communicable: DINOv2 dominates on spatially-visible ramp physics (98.3% vs 95.1%), while V-JEPA 2 dominates on dynamics-only collision physics (87.4% vs 77.7%, d=2.74). Scale-matched (d=3.37) and frame-matched (d=6.53) controls attribute this gap entirely to video-native pretraining. The frozen protocol supports action-conditioned planning (91.5%) with counterfactual velocity reasoning (r=0.780). Validation on Physics 101 real camera footage confirms 85.6% mass-comparison accuracy on unseen objects, temporal dynamics contributing +11.2% beyond static appearance, agent-scaling compositionality replicating at 90% for 4 agents, and causal intervention extending to real video (d=1.87, p=0.022).

Abstract PDF Upgrade to Chat

Authors (1)

Tomek Kaszyński

Summary

The paper demonstrates emergent compositional communication by enabling agents to encode unobservable physical properties through a discrete Gumbel-Softmax bottleneck.
It compares DINOv2 and V-JEPA architectures, revealing that V-JEPA excels in dynamics-dependent scenarios due to video-native pretraining.
The work validates its approach on real-world physics simulations, showing robust generalization and interpretable communication protocols.

Emergent Compositional Communication for Latent World Properties

Abstract and Introduction

The paper "Emergent Compositional Communication for Latent World Properties" (2604.03266) investigates the capacity of artificial agents to develop structured, discrete representations of unobservable physical properties under multi-agent communication pressure. Utilizing a Gumbel-Softmax bottleneck, agents endeavor to encode attributes such as elasticity, friction, and mass ratio from pretrained video features, extracting temporal dynamics invisible within any single observation.

Under controlled setups using DINOv2 and V-JEPA architectures, the work highlights the interplay of video-native pretraining and communication constraints in fostering compositional language-like protocols. Such emergent abstractions have implications for robotics, interpretable world modeling, and multi-agent coordination in settings demanding an understanding of latent physical components.

Experimental Setup

The research juxtaposes DINOv2 and V-JEPA architectures through factorial comparisons, demonstrating their respective propensities under physics simulations. DINOv2 excels in spatial physics, while V-JEPA outperforms in dynamics-dependent scenarios, attributing the difference to video-native pretraining strategies.

Agents interact with controlled physics simulations, observing balls with hidden physical attributes. Making pairwise comparisons, they compress these observations into discrete messages communicated through specific symbolic bottlenecks, investigating whether a message’s structure mirrors the latent properties of the world.

Results

Compositionality and Generalization: With optimal setup conditions—compatible perception backbones, iterative learning, and disentangled message positions—agent communications exhibit compositionality in over 54% of cases when using two agents, synchronizing fully when extended to larger agent pools (four agents achieving 100% compositionality). The emergent protocols demonstrate clear diagonal specialization, wherein distinct message positions align uniquely with specific properties (Figure 1).

Figure 1: Mutual information between message positions and physical properties. Compositional agents (left) show clean diagonal specialization; holistic agents (right) encode both properties in a single undifferentiated symbol.

Backbone-Dependent Advantage: V-JEPA's prowess in motion-oriented collision dynamics highlights video-native gains, not replicated even by scale or frame-matched DINOv2, emphasizing pretraining influence over network capacity and input size (Figure 2).

Figure 2: Causal intervention on message positions during cross-property reasoning. The receiver selectively reads elasticity from position 0 of message A (-14.7\%) and friction from position 1 of message B (-15.2\%), while irrelevant positions cause negligible drops.

Transfer and Real-World Validation: The robustness of the method on Physics101 validates real-world applicability. Utilizing real videos with tangible physics, agents show capacity to generalize learned compositional structures beyond controlled simulations, reaffirming the method’s practical relevance.

Figure 3: Agents allocate communication bandwidth proportional to property extractability across both vision (6 properties) and physics (3 properties) domains. Correlations are strong but based on small n and should be interpreted as suggestive of information-allocation principles.

Conclusion

This study offers nuanced insights into compositional communication for AI, delineating how multiple independent agents can align representations of unobservable physical phenomena via structured protocols, driven by pretraining dynamics and not by model scaling or simple baseline adjustments. Critically, such advancements engender a pathway to interpretable and communicative AI models, serving as a foundational step towards autonomous machine intelligence with intrinsic physical comprehension, aligning with JEPA foresights in cognitive architecture.

The implications of discrete, structured communication interfaces extend to aiding discernible and actionable decision-making processes for AI planning and task execution, fostering greater integration of intelligent systems within naturally dynamic environments.

Markdown Report Issue