From Local Cues to Global Percepts: Emergent Gestalt Organization in Self-Supervised Vision Models

Published 31 May 2025 in cs.CV and cs.AI | (2506.00718v1)

Abstract: Human vision organizes local cues into coherent global forms using Gestalt principles like closure, proximity, and figure-ground assignment -- functions reliant on global spatial structure. We investigate whether modern vision models show similar behaviors, and under what training conditions these emerge. We find that Vision Transformers (ViTs) trained with Masked Autoencoding (MAE) exhibit activation patterns consistent with Gestalt laws, including illusory contour completion, convexity preference, and dynamic figure-ground segregation. To probe the computational basis, we hypothesize that modeling global dependencies is necessary for Gestalt-like organization. We introduce the Distorted Spatial Relationship Testbench (DiSRT), which evaluates sensitivity to global spatial perturbations while preserving local textures. Using DiSRT, we show that self-supervised models (e.g., MAE, CLIP) outperform supervised baselines and sometimes even exceed human performance. ConvNeXt models trained with MAE also exhibit Gestalt-compatible representations, suggesting such sensitivity can arise without attention architectures. However, classification finetuning degrades this ability. Inspired by biological vision, we show that a Top-K activation sparsity mechanism can restore global sensitivity. Our findings identify training conditions that promote or suppress Gestalt-like perception and establish DiSRT as a diagnostic for global structure sensitivity across models.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

Emergent Gestalt Organization in Self-Supervised Vision Models: An Analytical Overview

The paper, "From Local Cues to Global Percepts: Emergent Gestalt Organization in Self-Supervised Vision Models," provides an in-depth examination of whether self-supervised vision models exhibit perceptual behaviors similar to Gestalt principles, and what training conditions foster these behaviors. The paper acknowledges that human visual perception is inherently holistic, governed by Gestalt principles such as closure, proximity, continuity, and figure-ground assignment, which are essential for understanding global spatial structures. The central inquiry is whether modern vision models, specifically those employing Vision Transformers (ViTs) and Masked Autoencoding (MAE), similarly demonstrate sensitivity to these global spatial relationships.

Methodological Insights into Gestalt Organization

The authors establish a methodological framework that leverages ViTs trained with MAE, demonstrating that such models exhibit internal activation patterns consistent with Gestalt principles. The paper introduces the Distorted Spatial Relationship Testbench (DiSRT), a benchmark designed to assess a model's sensitivity to perturbations in global spatial structure while retaining local textures. Using DiSRT, the authors reveal that self-supervised models (MAE, CLIP) outperform conventional supervised baselines, sometimes even surpassing human performance. Notably, ConvNeXt models trained with MAE exhibit Gestalt-like representations, suggesting that architectural biases are not a prerequisite for such global structure sensitivity.

Key Results and Interpretations

The study highlights several core findings:

Emergent Sensitivity in Self-Supervised Models: The models trained under self-supervised paradigms like MAE and CLIP demonstrate enhanced performance on DiSRT compared to their supervised counterparts. This enhancement suggests that training objectives focused on reconstruction rather than categorization contribute significantly to the emergence of Gestalt-compatible perception.
Fragility of Global Sensitivity: The paper observes a degradation in DiSRT performance when models trained under self-supervised regimes undergo conventional supervised finetuning. This indicates potential suppression of global perceptual organization by classification-centric training objectives.
Reviving Perceptual Sensitivity: Employing a Top-K activation sparsity mechanism effectively restores global sensitivity in models diminished by supervised finetuning. This biologically inspired intervention underscores the potential to bridge neural computation with cognitive principles.

Practical and Theoretical Implications

The findings have substantial implications for both practical applications and theoretical understanding of artificial perception. Practically, incorporating Gestalt principles into self-supervised learning frameworks could enhance the robustness and interpretability of vision models, particularly in applications requiring nuanced visual understanding outside standard classification tasks. Theoretically, this research bridges classic theories of human perception with contemporary machine learning paradigms, offering insights into the computational analogs of perceptual phenomena.

Future Directions

The study paves the way for exploring alternative training objectives and architectural designs that inherently encourage global dependency modeling. Additionally, refining benchmarks like DiSRT could further elucidate the conditions necessary for developing structured representations in deep neural networks, potentially informing developments in related areas such as neuromorphic computing and biologically-inspired AI systems.

In conclusion, this paper contributes significantly to the discourse on perceptual organization in artificial intelligence, suggesting that self-supervised learning strategies hold promise for fostering Gestalt-compatible behaviors, provided that global context modeling is prioritized in training methodologies.

Markdown Report Issue