Attractor-Based Frame-Level Modeling
- Attractor-based frame-level modeling is a framework that binds spatiotemporal features into stable Gestalt states using recurrent neural dynamics.
- The approach employs LSTM architectures with retrospective inference and mutually-exclusive softmax to dynamically bind features and predict canonical motions.
- Experimental demonstrations, such as the silhouette illusion, validate its ability to resolve ambiguous sensory inputs by switching attractor states.
Attractor-based frame-level modeling is a computational framework for dynamic perceptual inference that combines recurrent neural architectures, feature-binding mechanisms, and latent bias adaptation to resolve ambiguous sensory inputs into stable, interpretable Gestalt states. This approach, exemplified by the model described in "Binding Dancers Into Attractors" (Kaltenberger et al., 2022), implements perception as an online process that binds spatiotemporal features into canonical entities and infers observer-centric viewpoints through attractor dynamics in recurrent networks.
1. LSTM-based Gestalt Encoding
The core of attractor-based modeling in this paradigm is a long short-term memory (LSTM) architecture that predicts canonical 3D motion dynamics from input sequences of feature markers. At each time step , the network receives body-marker features, each comprising a 3D position and velocity . For , the input vector .
The LSTM cell is instantiated with hidden state and cell state (silhouette task: $300$ units). The read-out layer , with bias , predicts the next frame: LSTM update follows standard equations. In closed-loop operation, the LSTM’s state converges to a periodic orbit—a limit-cycle attractor—that corresponds to a learned Gestalt motion pattern (walking, spinning, etc.).
2. Retrospective Inference of Binding and Perspective
Online, the model infers latent bias variables at each frame: the binding activities , translational bias , and rotational bias (unit quaternion). Retrospective inference is performed by minimizing the prediction error over a temporal window of length using the smooth-L1 (Huber) loss: Gradients of with respect to each latent are computed via backpropagation-through-time and updated using momentum: The quaternion is normalized after each update to enforce . Sign-damping is optionally applied to stabilize gradients.
3. Mutual-Exclusive Softmax Feature Binding
The model binds observed features to canonical entities using a mutually-exclusive softmax scheme, enforcing near one-to-one assignments. Given raw binding logits , row-wise softmax yields selection probabilities: Column-wise softmax excludes duplicate assignments: The final binding assignment is: Temperature is annealed over time to sharpen assignments, with outcast features channeled to a “reject” row for asymmetric binding.
4. Attractor States and Dynamic Stability
Each trained motion pattern yields a distinct attractor—a limit cycle in LSTM state-space . In closed-loop mode, after teacher-forcing on initial frames, the network’s state converges to a periodic orbit: for some period . Perturbations decay exponentially,
implying asymptotic convergence to the basin of a Gestalt attractor. Prediction errors minimize as the system settles onto the most plausible canonical interpretation.
5. Training Paradigms and Hyper-parameters
Training proceeds with mean-squared-error loss for offline LSTM fitting: Adam optimizer is used (lr = 0.01), with noise injection on inputs ( for walker, for dancer). Batches are 10 (walker) or 20 (dancer) consecutive frames; epochs: 2000 (walker), 500 (dancer); hidden units: 100 (walker), 300 (dancer).
Online retrospective inference optimizes smooth L1 loss over horizon , with learning rates and momentum from Table I in (Kaltenberger et al., 2022). Tuning cycles per step are (walker), (silhouette), and sign-damping –$0.95$.
6. Silhouette Illusion: Experimental Demonstration
To probe perceptual bistability, the system is trained on four Gestalt attractors—D (un-mirrored CCW rotation), D (CW), E (mirrored CCW), E (mirrored CW). During inference (partial observation: , only; depth latent):
- Initial 80 frames: binding matrix is clamped, perspective fixed, network settles into an attractor (e.g., D).
- Post 80 frames: is unfixed, RI proceeds, attractor persists.
- At frame 200, a true depth cue () for one feature (left hand) is injected from the opposite Gestalt (E); temperature is reset.
- Within 50 frames, hidden state flips to E attractor, binding matrix reconfigures to mirrored structure.
Read-outs monitor feature-binding error ,
and prediction MSE, which spikes at the attractor switch then stabilizes. Reconstruction of depth and rotation direction confirms ambiguity resolution and CW CCW flip.
7. Claims on Universality and Broader Applicability
The model’s mechanisms—temporal Gestalt encoding via RNN attractors, retrospective inference, and mutually-exclusive softmax binding—are proposed as general solutions for perceptual interpretation, conceptual event binding, and language (binding words to semantic roles). Connections are drawn to predictive coding and free-energy models: adaptation of latent biases approximates inferring causes by prediction-error minimization.
A plausible implication is extension to multi-object scenes, audio streams, and static bistable phenomena (e.g., Necker cube) via hybrid dynamic/static Gestalt modules. This suggests that attractor-based frame-level modeling can serve as a universal schema for real-time perceptual inference across domains (Kaltenberger et al., 2022).