Papers
Topics
Authors
Recent
Search
2000 character limit reached

Attractor-Based Frame-Level Modeling

Updated 30 January 2026
  • Attractor-based frame-level modeling is a framework that binds spatiotemporal features into stable Gestalt states using recurrent neural dynamics.
  • The approach employs LSTM architectures with retrospective inference and mutually-exclusive softmax to dynamically bind features and predict canonical motions.
  • Experimental demonstrations, such as the silhouette illusion, validate its ability to resolve ambiguous sensory inputs by switching attractor states.

Attractor-based frame-level modeling is a computational framework for dynamic perceptual inference that combines recurrent neural architectures, feature-binding mechanisms, and latent bias adaptation to resolve ambiguous sensory inputs into stable, interpretable Gestalt states. This approach, exemplified by the model described in "Binding Dancers Into Attractors" (Kaltenberger et al., 2022), implements perception as an online process that binds spatiotemporal features into canonical entities and infers observer-centric viewpoints through attractor dynamics in recurrent networks.

1. LSTM-based Gestalt Encoding

The core of attractor-based modeling in this paradigm is a long short-term memory (LSTM) architecture that predicts canonical 3D motion dynamics from input sequences of feature markers. At each time step tt, the network receives NN body-marker features, each comprising a 3D position pjtR3p^t_j \in \mathbb{R}^3 and velocity vjtR3v^t_j \in \mathbb{R}^3. For N=15N=15, the input vector xtR90x^t \in \mathbb{R}^{90}.

The LSTM cell is instantiated with hidden state htR100h^t \in \mathbb{R}^{100} and cell state ccelltR100c^t_{\text{cell}} \in \mathbb{R}^{100} (silhouette task: $300$ units). The read-out layer WoutR90×100W^{\text{out}} \in \mathbb{R}^{90 \times 100}, with bias boutR90b^{\text{out}} \in \mathbb{R}^{90}, predicts the next frame: x^t+1=Woutht+bout\hat{x}^{t+1} = W^{\text{out}} h^t + b^{\text{out}} LSTM update follows standard equations. In closed-loop operation, the LSTM’s state converges to a periodic orbit—a limit-cycle attractor—that corresponds to a learned Gestalt motion pattern (walking, spinning, etc.).

2. Retrospective Inference of Binding and Perspective

Online, the model infers latent bias variables at each frame: the binding activities ARN×NA \in \mathbb{R}^{N \times N}, translational bias cR3c \in \mathbb{R}^3, and rotational bias qq (unit quaternion). Retrospective inference is performed by minimizing the prediction error over a temporal window of length H10H \geq 10 using the smooth-L1 (Huber) loss: Lt(A,c,q)=s=tH+1t(xs,x^s(A,c,q))\mathcal{L}_t(A,c,q) = \sum_{s = t-H+1}^t \ell\bigl(x^s, \hat{x}^s(A, c, q)\bigr) Gradients of Lt\mathcal{L}_t with respect to each latent are computed via backpropagation-through-time and updated using momentum: AAηbLA,ccηtLc,qqηrLqA \leftarrow A - \eta^b \frac{\partial \mathcal{L}}{\partial A}, \quad c \leftarrow c - \eta^t \frac{\partial \mathcal{L}}{\partial c}, \quad q \leftarrow q - \eta^r \frac{\partial \mathcal{L}}{\partial q} The quaternion qq is normalized after each update to enforce q=1|q| = 1. Sign-damping is optionally applied to stabilize gradients.

3. Mutual-Exclusive Softmax Feature Binding

The model binds observed features to canonical entities using a mutually-exclusive softmax scheme, enforcing near one-to-one assignments. Given raw binding logits A=(aij)RN×NA = (a_{ij}) \in \mathbb{R}^{N \times N}, row-wise softmax yields selection probabilities: bijrw=exp(aij/τrw)jexp(aij/τrw)b^{rw}_{ij} = \frac{\exp(a_{ij}/\tau^{rw})}{\sum_{j'}\exp(a_{ij'}/\tau^{rw})} Column-wise softmax excludes duplicate assignments: bijcw=exp(aij/τcw)iexp(aij/τcw)b^{cw}_{ij} = \frac{\exp(a_{ij}/\tau^{cw})}{\sum_{i'}\exp(a_{i'j}/\tau^{cw})} The final binding assignment is: bij=bijrwbijcwb_{ij} = \sqrt{b^{rw}_{ij} \cdot b^{cw}_{ij}} Temperature τ\tau is annealed over time to sharpen assignments, with outcast features channeled to a “reject” row for asymmetric binding.

4. Attractor States and Dynamic Stability

Each trained motion pattern yields a distinct attractor—a limit cycle in LSTM state-space (ht,ccellt)(h^t, c^t_{\text{cell}}). In closed-loop mode, after teacher-forcing on initial HinitH_{\text{init}} frames, the network’s state converges to a periodic orbit: ht+Tht,x^t+Tx^th^{t+T} \approx h^t,\quad \hat{x}^{t+T} \approx \hat{x}^t for some period TT. Perturbations Δh0\Delta h^0 decay exponentially,

htht10h^t - h^{t-1} \to 0

implying asymptotic convergence to the basin of a Gestalt attractor. Prediction errors minimize as the system settles onto the most plausible canonical interpretation.

5. Training Paradigms and Hyper-parameters

Training proceeds with mean-squared-error loss for offline LSTM fitting: Ltrain=1Tt=1Tx^txt2\mathcal{L}_{\mathrm{train}} = \frac{1}{T} \sum_{t=1}^T\|\hat{x}^t - x^t\|^2 Adam optimizer is used (lr = 0.01), with noise injection U(δ,δ)U(-\delta, \delta) on inputs (δ2105\delta \approx 2\cdot10^{-5} for walker, 10410^{-4} for dancer). Batches are 10 (walker) or 20 (dancer) consecutive frames; epochs: 2000 (walker), 500 (dancer); hidden units: 100 (walker), 300 (dancer).

Online retrospective inference optimizes smooth L1 loss over horizon H=10H=10, with learning rates and momentum from Table I in (Kaltenberger et al., 2022). Tuning cycles per step are cyc=1cyc=1 (walker), cyc=4cyc=4 (silhouette), and sign-damping α=0.9\alpha=0.9–$0.95$.

6. Silhouette Illusion: Experimental Demonstration

To probe perceptual bistability, the system is trained on four Gestalt attractors—D+^+ (un-mirrored CCW rotation), D^- (CW), E+^+ (mirrored CCW), E^- (mirrored CW). During inference (partial observation: xx, yy only; depth zz latent):

  • Initial 80 frames: binding matrix AA is clamped, perspective fixed, network settles into an attractor (e.g., D+^+).
  • Post 80 frames: AA is unfixed, RI proceeds, attractor persists.
  • At frame 200, a true depth cue (zz) for one feature (left hand) is injected from the opposite Gestalt (E^-); temperature τ\tau is reset.
  • Within \sim50 frames, hidden state flips to E^- attractor, binding matrix AA reconfigures to mirrored structure.

Read-outs monitor feature-binding error FBE(k)\mathrm{FBE}(k),

FBE(k)=ji(bijkbijopt)2\mathrm{FBE}(k) = \sum_j \sqrt{\sum_i (b_{ij}^k - b_{ij}^{opt})^2}

and prediction MSE, which spikes at the attractor switch then stabilizes. Reconstruction of depth and rotation direction confirms ambiguity resolution and CW \leftrightarrow CCW flip.

7. Claims on Universality and Broader Applicability

The model’s mechanisms—temporal Gestalt encoding via RNN attractors, retrospective inference, and mutually-exclusive softmax binding—are proposed as general solutions for perceptual interpretation, conceptual event binding, and language (binding words to semantic roles). Connections are drawn to predictive coding and free-energy models: adaptation of latent biases approximates inferring causes by prediction-error minimization.

A plausible implication is extension to multi-object scenes, audio streams, and static bistable phenomena (e.g., Necker cube) via hybrid dynamic/static Gestalt modules. This suggests that attractor-based frame-level modeling can serve as a universal schema for real-time perceptual inference across domains (Kaltenberger et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Attractor-Based Frame-Level Modeling.