Papers
Topics
Authors
Recent
Search
2000 character limit reached

RigidFormer: Object-Centric Transformer for Dynamics

Updated 4 July 2026
  • RigidFormer is an object-centric Transformer that learns multi-object rigid dynamics from mesh-free point clouds with controllable integration step size.
  • It employs anchor-based geometric aggregation and differentiable Kabsch alignment to enforce rigidity and reduce error accumulation.
  • The architecture scales with the number of objects and generalizes across point resolutions, ensuring efficient long-horizon predictions.

Searching arXiv for the specified paper and closely related work to ground the article and citations. RigidFormer is an object-centric Transformer architecture for learning multi-object rigid-body dynamics directly from mesh-free, object-segmented point clouds, with controllable integration step size Δt\Delta t and explicit rigidity enforcement through projection onto the rigid-body manifold (Dou et al., 9 May 2026). It is designed for regimes in which contact is discontinuous, long-horizon autoregressive rollout accumulates error, and mesh-based vertex/edge/facet message passing is either unavailable or computationally costly. The method advances each object through a compact anchor representation rather than dense vertex-level propagation, combines object-level attention with anchor-local geometric aggregation, and applies differentiable Kabsch alignment to guarantee rigid motion by construction. Reported properties include permutation-equivariant processing over objects, invariance to anchor reindexing, generalization to unseen point resolutions and across datasets, support for variable step sizes within a single model, and scalability to scenes with 200+ objects (Dou et al., 9 May 2026).

1. Problem setting and design objective

RigidFormer targets learned rigid-body simulation in settings where inputs are mesh-free point clouds rather than connected meshes (Dou et al., 9 May 2026). The motivating difficulty is twofold. First, many learned simulators assume mesh connectivity and operate through vertex-level message passing, which is problematic when only point clouds are available because connectivity is absent and visibility may be variable or partial. Second, as point resolution increases, dense vertex-level interaction becomes expensive, while contact discontinuities and long-horizon autoregressive rollout amplify error accumulation (Dou et al., 9 May 2026).

The model is formulated for MM rigid objects. Object ii at time tt is represented by positions Xt(i)RNi×3X_t^{(i)} \in \mathbb{R}^{N_i \times 3} together with per-vertex features ht(i)RNi×12h_t^{(i)} \in \mathbb{R}^{N_i \times 12} formed by concatenating nearest-neighbor displacement to other objects or ground, per-step position increment vt(i)=Xt(i)Xt1(i)v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}, reference offset rt(i)=Xt(i)Xref(i)r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}, and per-object physics parameters ϕ(i)=[m,μ,ϵ]\phi^{(i)} = [m, \mu, \epsilon] broadcast to vertices (Dou et al., 9 May 2026). A hierarchical PointNet-style encoder maps these per-vertex features to an object token ot(i)RDo_t^{(i)} \in \mathbb{R}^{D}, and the sequence of object tokens becomes the substrate for interaction modeling.

A central design objective is to make computational cost scale with the number of objects rather than the number of points. This objective informs the object-level Transformer, sparse anchor state, and anchor-local pooling scheme. A plausible implication is that RigidFormer is intended not merely as a point-cloud replacement for mesh-based simulators, but as a reallocation of computation from dense geometric neighborhoods to low-dimensional rigid motion carriers.

2. Architectural organization

RigidFormer comprises an object-level Transformer, compact per-object anchors, Anchor-Vertex Pooling (AVP), Anchor-based rotary positional embedding (ARoPE), and rigid projection through differentiable Kabsch alignment (Dou et al., 9 May 2026). The architecture reasons at the object level and then refines motion through sparse anchors that summarize each object’s low-dimensional rigid state.

The object-level decoder operates on MM0 object tokens concatenated with MM1 learned register tokens. With MM2 denoting this input, each of the MM3 Transformer blocks applies residual self-attention, per-layer FiLM conditioning on step size, and a residual feed-forward network. For a head with queries MM4, keys MM5, and values MM6, attention is the scaled dot-product form

MM7

Stability is improved through elementwise sigmoid gating,

MM8

where MM9 is a learned per-head MLP (Dou et al., 9 May 2026).

Temporal discretization enters through FiLM. Let ii0 with ii1. Then each layer applies

ii2

where ii3 and ii4 are MLPs producing channel-wise scales and shifts (Dou et al., 9 May 2026). This mechanism allows a single model to support multiple step sizes, so increasing ii5 reduces the number of autoregressive updates over a fixed physical horizon, while decreasing ii6 provides finer temporal detail.

For state advancement, each object uses ii7 anchors chosen by farthest-point sampling on the current point cloud. Anchors are represented by 3D positions ii8, tied to reference anchors ii9 for rigid alignment (Dou et al., 9 May 2026). Each anchor forms a query by concatenating anchor features, including AVP and ARoPE descriptors, then projecting to the model dimension. These anchor queries cross-attend to multi-scale decoder object-token features and predict per-anchor accelerations tt0.

Component Role Reported property
Object-level Transformer Models object-object interactions Permutation-equivariant over objects
AVP Injects local vertex features into anchors Vertex-order-invariant
ARoPE Injects sparse anchor geometry into attention Invariant to anchor reindexing
Differentiable Kabsch Projects updates to tt1 Guarantees rigidity by construction

The architectural hyperparameters reported for the main configuration are object token dimension tt2, tt3 heads with head dimension tt4, SwiGLU feed-forward with tt5 expansion, RMSNorm, dropout tt6, AVP output dimension tt7, and ARoPE descriptor dimension tt8 (Dou et al., 9 May 2026).

3. Anchor mechanisms, geometric encoding, and rigid projection

AVP is the mechanism through which local contact-relevant geometry is injected into anchor descriptors without dense vertex-level attention (Dou et al., 9 May 2026). Around anchor tt9, it aggregates per-vertex encoder features Xt(i)RNi×3X_t^{(i)} \in \mathbb{R}^{N_i \times 3}0 by a normalized distance kernel,

Xt(i)RNi×3X_t^{(i)} \in \mathbb{R}^{N_i \times 3}1

with learned bandwidth Xt(i)RNi×3X_t^{(i)} \in \mathbb{R}^{N_i \times 3}2 (Dou et al., 9 May 2026). Because the normalization is symmetric, AVP is invariant to vertex reindexing, and because the weights depend only on distances, the weights are invariant under common rigid transforms of anchor and vertex coordinates.

ARoPE supplies geometry-aware positional structure for attention while preserving set symmetries. For object Xt(i)RNi×3X_t^{(i)} \in \mathbb{R}^{N_i \times 3}3 with anchor positions Xt(i)RNi×3X_t^{(i)} \in \mathbb{R}^{N_i \times 3}4, a shared 3D rotary map Xt(i)RNi×3X_t^{(i)} \in \mathbb{R}^{N_i \times 3}5 produces per-anchor rotary features, and the per-object descriptor is the mean

Xt(i)RNi×3X_t^{(i)} \in \mathbb{R}^{N_i \times 3}6

Mean pooling makes the descriptor invariant to anchor reindexing (Dou et al., 9 May 2026). Queries and keys are then split into rotary and pass-through parts and transformed by standard RoPE-style even-odd rotations using the ARoPE descriptors. Because no sequence-index embeddings are used, the decoder remains permutation-equivariant over objects.

After acceleration prediction, anchors are integrated with Verlet,

Xt(i)RNi×3X_t^{(i)} \in \mathbb{R}^{N_i \times 3}7

The resulting positions are then projected onto the rigid-body manifold by solving

Xt(i)RNi×3X_t^{(i)} \in \mathbb{R}^{N_i \times 3}8

using closed-form Kabsch alignment based on centroids, centered anchor sets, covariance Xt(i)RNi×3X_t^{(i)} \in \mathbb{R}^{N_i \times 3}9, SVD ht(i)RNi×12h_t^{(i)} \in \mathbb{R}^{N_i \times 12}0, and the corrected rotation

ht(i)RNi×12h_t^{(i)} \in \mathbb{R}^{N_i \times 12}1

with translation ht(i)RNi×12h_t^{(i)} \in \mathbb{R}^{N_i \times 12}2 (Dou et al., 9 May 2026). Gradients are implemented in a RoMa-style formulation for robustness near degenerate singular values, and reflections are suppressed through ht(i)RNi×12h_t^{(i)} \in \mathbb{R}^{N_i \times 12}3 to ensure ht(i)RNi×12h_t^{(i)} \in \mathbb{R}^{N_i \times 12}4.

The rigid transform is broadcast to all reference vertices,

ht(i)RNi×12h_t^{(i)} \in \mathbb{R}^{N_i \times 12}5

which preserves all intra-object distances exactly (Dou et al., 9 May 2026). This is the mechanism behind the claim of rigidity by construction and is presented as a principal source of long-horizon stability.

4. Symmetry structure and contact modeling

RigidFormer is explicitly organized around symmetry constraints at several levels (Dou et al., 9 May 2026). AVP is vertex-order-invariant because it is a symmetric normalized pooling operator over vertices. ARoPE is anchor-order-invariant because its descriptor is the mean of per-anchor rotary encodings. The object-level decoder is permutation-equivariant over objects: if ht(i)RNi×12h_t^{(i)} \in \mathbb{R}^{N_i \times 12}6 is a permutation matrix over objects and ht(i)RNi×12h_t^{(i)} \in \mathbb{R}^{N_i \times 12}7 acts jointly on object and register tokens, then self-attention without sequence-index embeddings satisfies

ht(i)RNi×12h_t^{(i)} \in \mathbb{R}^{N_i \times 12}8

and RMSNorm, FiLM, gated attention, and the feed-forward sublayers commute with the same permutation action (Dou et al., 9 May 2026).

These formal properties are not incidental. They define the model’s treatment of scene elements as sets rather than ordered sequences, which is important in mesh-free rigid-body simulation because point sets, anchors, and objects lack a canonical indexing. A plausible implication is that the architecture avoids the need to learn spurious order-dependent conventions that would otherwise degrade cross-scene transfer.

Contact is learned rather than solved analytically. The reported contact cues are per-vertex proximity features, local geometry injected via AVP around anchors near contact regions, cross-object attention at both the object-token and anchor-query levels, and geometry-aware attention modulated by ARoPE (Dou et al., 9 May 2026). No explicit complementarity solver or collision penalty is used. Instead, contact effects are represented through learned attention over geometry-aware features, while rigid projection removes intra-object shear and drift at every step. Gated attention is reported to attenuate spurious reads and improve long-horizon stability.

A common misconception would be to interpret RigidFormer as a purely geometric contact detector. The model instead combines geometric descriptors, dynamics surrogates such as per-step position increments, and object-level physical parameters ht(i)RNi×12h_t^{(i)} \in \mathbb{R}^{N_i \times 12}9 when available (Dou et al., 9 May 2026). Another potential misconception is that rigid projection itself models contact; in the reported formulation it enforces body consistency after learned interaction prediction rather than replacing interaction inference.

5. Training protocol and optimization

Training supervision is applied at the anchor level, both before and after rigid projection, using Smooth L1 penalties on position and acceleration (Dou et al., 9 May 2026). The total loss is

vt(i)=Xt(i)Xt1(i)v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}0

with vt(i)=Xt(i)Xt1(i)v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}1 and vt(i)=Xt(i)Xt1(i)v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}2 (Dou et al., 9 May 2026). Ground-truth acceleration is defined under the same Verlet discretization,

vt(i)=Xt(i)Xt1(i)v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}3

Vertices are supervised indirectly through the rigid transform rather than by direct vertex-wise rollout loss.

The optimization configuration reported for the main MOVi experiments is 300 epochs with AdamW, vt(i)=Xt(i)Xt1(i)v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}4, vt(i)=Xt(i)Xt1(i)v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}5, weight decay vt(i)=Xt(i)Xt1(i)v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}6, base learning rate vt(i)=Xt(i)Xt1(i)v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}7, 10-epoch linear warmup from vt(i)=Xt(i)Xt1(i)v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}8 the base learning rate, cosine decay to vt(i)=Xt(i)Xt1(i)v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}9, and gradient-norm clipping at rt(i)=Xt(i)Xref(i)r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}0 (Dou et al., 9 May 2026). The sequence length is rt(i)=Xt(i)Xref(i)r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}1, and rt(i)=Xt(i)Xref(i)r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}2 is sampled from rt(i)=Xt(i)Xref(i)r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}3. No curriculum or scheduled sampling was required in the reported MOVi configuration, and the batch size is 18 per process.

The practical guidance reported for use mirrors the empirical findings. It recommends rt(i)=Xt(i)Xref(i)r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}4 as a strong default, larger rt(i)=Xt(i)Xref(i)r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}5 values such as rt(i)=Xt(i)Xref(i)r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}6 or rt(i)=Xt(i)Xref(i)r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}7 for long-horizon prediction and planning, mixed-rt(i)=Xt(i)Xref(i)r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}8 training with FiLM conditioning, inclusion of nearest-neighbor displacement features for contact awareness, and random rt(i)=Xt(i)Xref(i)r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}9-axis rotations together with random object permutations as data augmentation (Dou et al., 9 May 2026). The text also notes that optimization effort should focus on KNN kernels because they are the primary runtime bottleneck.

6. Empirical performance, efficiency, and extensions

The main experiments are reported on MOVi-A, MOVi-B, and MOVi-Sphere, with mesh-free point inputs and evaluation by center-of-mass translation RMSE in meters and orientation RMSE in degrees via quaternion geodesic distance (Dou et al., 9 May 2026). Predictions are autoregressive, and different step sizes are mapped to physical frames for fair comparison.

On MOVi-B at 100 frames with step size ϕ(i)=[m,μ,ϵ]\phi^{(i)} = [m, \mu, \epsilon]0, RigidFormer reports ϕ(i)=[m,μ,ϵ]\phi^{(i)} = [m, \mu, \epsilon]1 versus HopNet’s ϕ(i)=[m,μ,ϵ]\phi^{(i)} = [m, \mu, \epsilon]2, while on MOVi-A, MOVi-B, and MOVi-Sphere it attains the best orientation error in all reported columns and the best or second-best translation error in most columns (Dou et al., 9 May 2026). Relative to SDF-Sim at ϕ(i)=[m,μ,ϵ]\phi^{(i)} = [m, \mu, \epsilon]3, the reported RigidFormer results are ϕ(i)=[m,μ,ϵ]\phi^{(i)} = [m, \mu, \epsilon]4 at step size ϕ(i)=[m,μ,ϵ]\phi^{(i)} = [m, \mu, \epsilon]5 and ϕ(i)=[m,μ,ϵ]\phi^{(i)} = [m, \mu, \epsilon]6 at step size ϕ(i)=[m,μ,ϵ]\phi^{(i)} = [m, \mu, \epsilon]7, without SDF prelearning (Dou et al., 9 May 2026).

A central empirical claim concerns variable step size. On MOVi-B at 100 frames, larger steps reduce long-horizon error: step ϕ(i)=[m,μ,ϵ]\phi^{(i)} = [m, \mu, \epsilon]8 yields ϕ(i)=[m,μ,ϵ]\phi^{(i)} = [m, \mu, \epsilon]9, step ot(i)RDo_t^{(i)} \in \mathbb{R}^{D}0 yields ot(i)RDo_t^{(i)} \in \mathbb{R}^{D}1, and step ot(i)RDo_t^{(i)} \in \mathbb{R}^{D}2 yields ot(i)RDo_t^{(i)} \in \mathbb{R}^{D}3 (Dou et al., 9 May 2026). Point-resolution generalization is also reported: training samples point counts ot(i)RDo_t^{(i)} \in \mathbb{R}^{D}4, while testing at 768 points produces stable rollouts with 100-frame MOVi-B errors of ot(i)RDo_t^{(i)} \in \mathbb{R}^{D}5, ot(i)RDo_t^{(i)} \in \mathbb{R}^{D}6, and ot(i)RDo_t^{(i)} \in \mathbb{R}^{D}7 for step sizes ot(i)RDo_t^{(i)} \in \mathbb{R}^{D}8, ot(i)RDo_t^{(i)} \in \mathbb{R}^{D}9, and MM00, respectively (Dou et al., 9 May 2026). Cross-dataset transfer is reported to consistently improve over FIGNet and remain competitive with HopNet in matched step-size-MM01 settings, while larger steps further reduce long-horizon errors. With 25% per-object points masked at test time and no retraining, rollouts are reported to remain stable with accurate contacts (Dou et al., 9 May 2026).

The computational comparison emphasizes the difference between vertex-level and object-level scaling. For a MOVi-B scene with MM02, one vertex-level attention layer would require approximately MM03 GFLOPs just for the MM04 term, whereas object-level self-attention over MM05 objects plus 16 registers has complexity MM06 and for MM07 costs approximately MM08 MFLOPs per layer for the quadratic term, a reported MM09 reduction before projections and feed-forward layers (Dou et al., 9 May 2026). Measured runtime on an RTX 5080 is MM10 ms/step, or MM11 FPS, compared with FIGNet at MM12 ms/step (MM13 FPS) and HopNet at MM14 ms/step (MM15 FPS); differentiable Kabsch adds approximately MM16 ms/step, and the main bottleneck is CUDA KNN for proximity and AVP (Dou et al., 9 May 2026).

Scalability is demonstrated on WreckingBall scenes with 64, 125, and 216 cubes plus a projectile, showing stable simulation at approximately 20 FPS (Dou et al., 9 May 2026). A preliminary articulated extension treats body parts as interacting object-level components and FiLM-conditions the model on heading commands. On ASE humanoid and Unitree G1 steering, the reported 100-step errors are MM17 and MM18, respectively (Dou et al., 9 May 2026).

7. Ablations, limitations, and research significance

The ablation results attribute measurable gains to several design choices (Dou et al., 9 May 2026). ARoPE is reported to yield the best or tied-best position error in most cells and the best orientation error in most cells when compared with sinusoidal, learned absolute positional encoding, OBB/PCA, and SE(3) variants. Gated attention improves long-horizon position error, with an example at 100 steps and step size MM19 changing from MM20 to MM21. Differentiable Kabsch reduces both long-horizon position and orientation errors relative to non-differentiable alignment. For anchor count, MM22 is identified as balancing accuracy and efficiency, while MM23 can sometimes reduce rotation error but is less favorable in the translation-cost trade-off; randomized FPS anchors during training and evaluation show robustness to anchor selection (Dou et al., 9 May 2026).

The stated limitations are specific. RigidFormer requires object segmentation to group points per object. Contact is learned from data and does not use an explicit complementarity solver; although rigid projection stabilizes motion, sharp or rare contact regimes may benefit from hybrid analytic-learned corrections or contact-aware losses. Severe partial observations can challenge geometry inference. SVD gradients can become ill-conditioned near degeneracies, and robust layers such as RoMa mitigate but do not eliminate all edge cases (Dou et al., 9 May 2026). Future work suggested in the reported text includes stronger occlusions, sensor noise, online segmentation, mixed rigid-deformable scenes, and adaptive time stepping.

The broader significance claimed for the method is that compact, geometry-aware object anchoring together with rigid manifold projection provides a strong inductive bias for mesh-free rigid dynamics (Dou et al., 9 May 2026). This suggests a shift from dense vertex graphs toward object-level attention and anchor-level updates as a way to preserve contact fidelity while improving speed and scalability. The reported variable-MM24 conditioning also suggests a practical bridge between coarse long-horizon prediction and fine temporal refinement, and the preliminary articulated results indicate a possible path toward unified mesh-free models of more complex rigid dynamics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RigidFormer.