RigidFormer: Object-Centric Transformer for Dynamics

Updated 4 July 2026

RigidFormer is an object-centric Transformer that learns multi-object rigid dynamics from mesh-free point clouds with controllable integration step size.
It employs anchor-based geometric aggregation and differentiable Kabsch alignment to enforce rigidity and reduce error accumulation.
The architecture scales with the number of objects and generalizes across point resolutions, ensuring efficient long-horizon predictions.

Searching arXiv for the specified paper and closely related work to ground the article and citations. RigidFormer is an object-centric Transformer architecture for learning multi-object rigid-body dynamics directly from mesh-free, object-segmented point clouds, with controllable integration step size $\Delta t$ and explicit rigidity enforcement through projection onto the rigid-body manifold (Dou et al., 9 May 2026). It is designed for regimes in which contact is discontinuous, long-horizon autoregressive rollout accumulates error, and mesh-based vertex/edge/facet message passing is either unavailable or computationally costly. The method advances each object through a compact anchor representation rather than dense vertex-level propagation, combines object-level attention with anchor-local geometric aggregation, and applies differentiable Kabsch alignment to guarantee rigid motion by construction. Reported properties include permutation-equivariant processing over objects, invariance to anchor reindexing, generalization to unseen point resolutions and across datasets, support for variable step sizes within a single model, and scalability to scenes with 200+ objects (Dou et al., 9 May 2026).

1. Problem setting and design objective

RigidFormer targets learned rigid-body simulation in settings where inputs are mesh-free point clouds rather than connected meshes (Dou et al., 9 May 2026). The motivating difficulty is twofold. First, many learned simulators assume mesh connectivity and operate through vertex-level message passing, which is problematic when only point clouds are available because connectivity is absent and visibility may be variable or partial. Second, as point resolution increases, dense vertex-level interaction becomes expensive, while contact discontinuities and long-horizon autoregressive rollout amplify error accumulation (Dou et al., 9 May 2026).

The model is formulated for $M$ rigid objects. Object $i$ at time $t$ is represented by positions $X_t^{(i)} \in \mathbb{R}^{N_i \times 3}$ together with per-vertex features $h_t^{(i)} \in \mathbb{R}^{N_i \times 12}$ formed by concatenating nearest-neighbor displacement to other objects or ground, per-step position increment $v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}$ , reference offset $r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}$ , and per-object physics parameters $\phi^{(i)} = [m, \mu, \epsilon]$ broadcast to vertices (Dou et al., 9 May 2026). A hierarchical PointNet-style encoder maps these per-vertex features to an object token $o_t^{(i)} \in \mathbb{R}^{D}$ , and the sequence of object tokens becomes the substrate for interaction modeling.

A central design objective is to make computational cost scale with the number of objects rather than the number of points. This objective informs the object-level Transformer, sparse anchor state, and anchor-local pooling scheme. A plausible implication is that RigidFormer is intended not merely as a point-cloud replacement for mesh-based simulators, but as a reallocation of computation from dense geometric neighborhoods to low-dimensional rigid motion carriers.

2. Architectural organization

RigidFormer comprises an object-level Transformer, compact per-object anchors, Anchor-Vertex Pooling (AVP), Anchor-based rotary positional embedding (ARoPE), and rigid projection through differentiable Kabsch alignment (Dou et al., 9 May 2026). The architecture reasons at the object level and then refines motion through sparse anchors that summarize each object’s low-dimensional rigid state.

The object-level decoder operates on $M$ 0 object tokens concatenated with $M$ 1 learned register tokens. With $M$ 2 denoting this input, each of the $M$ 3 Transformer blocks applies residual self-attention, per-layer FiLM conditioning on step size, and a residual feed-forward network. For a head with queries $M$ 4, keys $M$ 5, and values $M$ 6, attention is the scaled dot-product form

$M$ 7

Stability is improved through elementwise sigmoid gating,

$M$ 8

where $M$ 9 is a learned per-head MLP (Dou et al., 9 May 2026).

Temporal discretization enters through FiLM. Let $i$ 0 with $i$ 1. Then each layer applies

$i$ 2

where $i$ 3 and $i$ 4 are MLPs producing channel-wise scales and shifts (Dou et al., 9 May 2026). This mechanism allows a single model to support multiple step sizes, so increasing $i$ 5 reduces the number of autoregressive updates over a fixed physical horizon, while decreasing $i$ 6 provides finer temporal detail.

For state advancement, each object uses $i$ 7 anchors chosen by farthest-point sampling on the current point cloud. Anchors are represented by 3D positions $i$ 8, tied to reference anchors $i$ 9 for rigid alignment (Dou et al., 9 May 2026). Each anchor forms a query by concatenating anchor features, including AVP and ARoPE descriptors, then projecting to the model dimension. These anchor queries cross-attend to multi-scale decoder object-token features and predict per-anchor accelerations $t$ 0.

Component	Role	Reported property
Object-level Transformer	Models object-object interactions	Permutation-equivariant over objects
AVP	Injects local vertex features into anchors	Vertex-order-invariant
ARoPE	Injects sparse anchor geometry into attention	Invariant to anchor reindexing
Differentiable Kabsch	Projects updates to $t$ 1	Guarantees rigidity by construction

The architectural hyperparameters reported for the main configuration are object token dimension $t$ 2, $t$ 3 heads with head dimension $t$ 4, SwiGLU feed-forward with $t$ 5 expansion, RMSNorm, dropout $t$ 6, AVP output dimension $t$ 7, and ARoPE descriptor dimension $t$ 8 (Dou et al., 9 May 2026).

3. Anchor mechanisms, geometric encoding, and rigid projection

AVP is the mechanism through which local contact-relevant geometry is injected into anchor descriptors without dense vertex-level attention (Dou et al., 9 May 2026). Around anchor $t$ 9, it aggregates per-vertex encoder features $X_t^{(i)} \in \mathbb{R}^{N_i \times 3}$ 0 by a normalized distance kernel,

$X_t^{(i)} \in \mathbb{R}^{N_i \times 3}$ 1

with learned bandwidth $X_t^{(i)} \in \mathbb{R}^{N_i \times 3}$ 2 (Dou et al., 9 May 2026). Because the normalization is symmetric, AVP is invariant to vertex reindexing, and because the weights depend only on distances, the weights are invariant under common rigid transforms of anchor and vertex coordinates.

ARoPE supplies geometry-aware positional structure for attention while preserving set symmetries. For object $X_t^{(i)} \in \mathbb{R}^{N_i \times 3}$ 3 with anchor positions $X_t^{(i)} \in \mathbb{R}^{N_i \times 3}$ 4, a shared 3D rotary map $X_t^{(i)} \in \mathbb{R}^{N_i \times 3}$ 5 produces per-anchor rotary features, and the per-object descriptor is the mean

$X_t^{(i)} \in \mathbb{R}^{N_i \times 3}$ 6

Mean pooling makes the descriptor invariant to anchor reindexing (Dou et al., 9 May 2026). Queries and keys are then split into rotary and pass-through parts and transformed by standard RoPE-style even-odd rotations using the ARoPE descriptors. Because no sequence-index embeddings are used, the decoder remains permutation-equivariant over objects.

After acceleration prediction, anchors are integrated with Verlet,

$X_t^{(i)} \in \mathbb{R}^{N_i \times 3}$ 7

The resulting positions are then projected onto the rigid-body manifold by solving

$X_t^{(i)} \in \mathbb{R}^{N_i \times 3}$ 8

using closed-form Kabsch alignment based on centroids, centered anchor sets, covariance $X_t^{(i)} \in \mathbb{R}^{N_i \times 3}$ 9, SVD $h_t^{(i)} \in \mathbb{R}^{N_i \times 12}$ 0, and the corrected rotation

$h_t^{(i)} \in \mathbb{R}^{N_i \times 12}$ 1

with translation $h_t^{(i)} \in \mathbb{R}^{N_i \times 12}$ 2 (Dou et al., 9 May 2026). Gradients are implemented in a RoMa-style formulation for robustness near degenerate singular values, and reflections are suppressed through $h_t^{(i)} \in \mathbb{R}^{N_i \times 12}$ 3 to ensure $h_t^{(i)} \in \mathbb{R}^{N_i \times 12}$ 4.

The rigid transform is broadcast to all reference vertices,

$h_t^{(i)} \in \mathbb{R}^{N_i \times 12}$ 5

which preserves all intra-object distances exactly (Dou et al., 9 May 2026). This is the mechanism behind the claim of rigidity by construction and is presented as a principal source of long-horizon stability.

4. Symmetry structure and contact modeling

RigidFormer is explicitly organized around symmetry constraints at several levels (Dou et al., 9 May 2026). AVP is vertex-order-invariant because it is a symmetric normalized pooling operator over vertices. ARoPE is anchor-order-invariant because its descriptor is the mean of per-anchor rotary encodings. The object-level decoder is permutation-equivariant over objects: if $h_t^{(i)} \in \mathbb{R}^{N_i \times 12}$ 6 is a permutation matrix over objects and $h_t^{(i)} \in \mathbb{R}^{N_i \times 12}$ 7 acts jointly on object and register tokens, then self-attention without sequence-index embeddings satisfies

$h_t^{(i)} \in \mathbb{R}^{N_i \times 12}$ 8

and RMSNorm, FiLM, gated attention, and the feed-forward sublayers commute with the same permutation action (Dou et al., 9 May 2026).

These formal properties are not incidental. They define the model’s treatment of scene elements as sets rather than ordered sequences, which is important in mesh-free rigid-body simulation because point sets, anchors, and objects lack a canonical indexing. A plausible implication is that the architecture avoids the need to learn spurious order-dependent conventions that would otherwise degrade cross-scene transfer.

Contact is learned rather than solved analytically. The reported contact cues are per-vertex proximity features, local geometry injected via AVP around anchors near contact regions, cross-object attention at both the object-token and anchor-query levels, and geometry-aware attention modulated by ARoPE (Dou et al., 9 May 2026). No explicit complementarity solver or collision penalty is used. Instead, contact effects are represented through learned attention over geometry-aware features, while rigid projection removes intra-object shear and drift at every step. Gated attention is reported to attenuate spurious reads and improve long-horizon stability.

A common misconception would be to interpret RigidFormer as a purely geometric contact detector. The model instead combines geometric descriptors, dynamics surrogates such as per-step position increments, and object-level physical parameters $h_t^{(i)} \in \mathbb{R}^{N_i \times 12}$ 9 when available (Dou et al., 9 May 2026). Another potential misconception is that rigid projection itself models contact; in the reported formulation it enforces body consistency after learned interaction prediction rather than replacing interaction inference.

5. Training protocol and optimization

Training supervision is applied at the anchor level, both before and after rigid projection, using Smooth L1 penalties on position and acceleration (Dou et al., 9 May 2026). The total loss is

$v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}$ 0

with $v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}$ 1 and $v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}$ 2 (Dou et al., 9 May 2026). Ground-truth acceleration is defined under the same Verlet discretization,

$v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}$ 3

Vertices are supervised indirectly through the rigid transform rather than by direct vertex-wise rollout loss.

The optimization configuration reported for the main MOVi experiments is 300 epochs with AdamW, $v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}$ 4, $v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}$ 5, weight decay $v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}$ 6, base learning rate $v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}$ 7, 10-epoch linear warmup from $v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}$ 8 the base learning rate, cosine decay to $v_t^{(i)} = X_t^{(i)} - X_{t-1}^{(i)}$ 9, and gradient-norm clipping at $r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}$ 0 (Dou et al., 9 May 2026). The sequence length is $r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}$ 1, and $r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}$ 2 is sampled from $r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}$ 3. No curriculum or scheduled sampling was required in the reported MOVi configuration, and the batch size is 18 per process.

The practical guidance reported for use mirrors the empirical findings. It recommends $r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}$ 4 as a strong default, larger $r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}$ 5 values such as $r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}$ 6 or $r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}$ 7 for long-horizon prediction and planning, mixed- $r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}$ 8 training with FiLM conditioning, inclusion of nearest-neighbor displacement features for contact awareness, and random $r_t^{(i)} = X_t^{(i)} - X_{\mathrm{ref}}^{(i)}$ 9-axis rotations together with random object permutations as data augmentation (Dou et al., 9 May 2026). The text also notes that optimization effort should focus on KNN kernels because they are the primary runtime bottleneck.

6. Empirical performance, efficiency, and extensions

The main experiments are reported on MOVi-A, MOVi-B, and MOVi-Sphere, with mesh-free point inputs and evaluation by center-of-mass translation RMSE in meters and orientation RMSE in degrees via quaternion geodesic distance (Dou et al., 9 May 2026). Predictions are autoregressive, and different step sizes are mapped to physical frames for fair comparison.

On MOVi-B at 100 frames with step size $\phi^{(i)} = [m, \mu, \epsilon]$ 0, RigidFormer reports $\phi^{(i)} = [m, \mu, \epsilon]$ 1 versus HopNet’s $\phi^{(i)} = [m, \mu, \epsilon]$ 2, while on MOVi-A, MOVi-B, and MOVi-Sphere it attains the best orientation error in all reported columns and the best or second-best translation error in most columns (Dou et al., 9 May 2026). Relative to SDF-Sim at $\phi^{(i)} = [m, \mu, \epsilon]$ 3, the reported RigidFormer results are $\phi^{(i)} = [m, \mu, \epsilon]$ 4 at step size $\phi^{(i)} = [m, \mu, \epsilon]$ 5 and $\phi^{(i)} = [m, \mu, \epsilon]$ 6 at step size $\phi^{(i)} = [m, \mu, \epsilon]$ 7, without SDF prelearning (Dou et al., 9 May 2026).

A central empirical claim concerns variable step size. On MOVi-B at 100 frames, larger steps reduce long-horizon error: step $\phi^{(i)} = [m, \mu, \epsilon]$ 8 yields $\phi^{(i)} = [m, \mu, \epsilon]$ 9, step $o_t^{(i)} \in \mathbb{R}^{D}$ 0 yields $o_t^{(i)} \in \mathbb{R}^{D}$ 1, and step $o_t^{(i)} \in \mathbb{R}^{D}$ 2 yields $o_t^{(i)} \in \mathbb{R}^{D}$ 3 (Dou et al., 9 May 2026). Point-resolution generalization is also reported: training samples point counts $o_t^{(i)} \in \mathbb{R}^{D}$ 4, while testing at 768 points produces stable rollouts with 100-frame MOVi-B errors of $o_t^{(i)} \in \mathbb{R}^{D}$ 5, $o_t^{(i)} \in \mathbb{R}^{D}$ 6, and $o_t^{(i)} \in \mathbb{R}^{D}$ 7 for step sizes $o_t^{(i)} \in \mathbb{R}^{D}$ 8, $o_t^{(i)} \in \mathbb{R}^{D}$ 9, and $M$ 00, respectively (Dou et al., 9 May 2026). Cross-dataset transfer is reported to consistently improve over FIGNet and remain competitive with HopNet in matched step-size- $M$ 01 settings, while larger steps further reduce long-horizon errors. With 25% per-object points masked at test time and no retraining, rollouts are reported to remain stable with accurate contacts (Dou et al., 9 May 2026).

The computational comparison emphasizes the difference between vertex-level and object-level scaling. For a MOVi-B scene with $M$ 02, one vertex-level attention layer would require approximately $M$ 03 GFLOPs just for the $M$ 04 term, whereas object-level self-attention over $M$ 05 objects plus 16 registers has complexity $M$ 06 and for $M$ 07 costs approximately $M$ 08 MFLOPs per layer for the quadratic term, a reported $M$ 09 reduction before projections and feed-forward layers (Dou et al., 9 May 2026). Measured runtime on an RTX 5080 is $M$ 10 ms/step, or $M$ 11 FPS, compared with FIGNet at $M$ 12 ms/step ( $M$ 13 FPS) and HopNet at $M$ 14 ms/step ( $M$ 15 FPS); differentiable Kabsch adds approximately $M$ 16 ms/step, and the main bottleneck is CUDA KNN for proximity and AVP (Dou et al., 9 May 2026).

Scalability is demonstrated on WreckingBall scenes with 64, 125, and 216 cubes plus a projectile, showing stable simulation at approximately 20 FPS (Dou et al., 9 May 2026). A preliminary articulated extension treats body parts as interacting object-level components and FiLM-conditions the model on heading commands. On ASE humanoid and Unitree G1 steering, the reported 100-step errors are $M$ 17 and $M$ 18, respectively (Dou et al., 9 May 2026).

7. Ablations, limitations, and research significance

The ablation results attribute measurable gains to several design choices (Dou et al., 9 May 2026). ARoPE is reported to yield the best or tied-best position error in most cells and the best orientation error in most cells when compared with sinusoidal, learned absolute positional encoding, OBB/PCA, and SE(3) variants. Gated attention improves long-horizon position error, with an example at 100 steps and step size $M$ 19 changing from $M$ 20 to $M$ 21. Differentiable Kabsch reduces both long-horizon position and orientation errors relative to non-differentiable alignment. For anchor count, $M$ 22 is identified as balancing accuracy and efficiency, while $M$ 23 can sometimes reduce rotation error but is less favorable in the translation-cost trade-off; randomized FPS anchors during training and evaluation show robustness to anchor selection (Dou et al., 9 May 2026).

The stated limitations are specific. RigidFormer requires object segmentation to group points per object. Contact is learned from data and does not use an explicit complementarity solver; although rigid projection stabilizes motion, sharp or rare contact regimes may benefit from hybrid analytic-learned corrections or contact-aware losses. Severe partial observations can challenge geometry inference. SVD gradients can become ill-conditioned near degeneracies, and robust layers such as RoMa mitigate but do not eliminate all edge cases (Dou et al., 9 May 2026). Future work suggested in the reported text includes stronger occlusions, sensor noise, online segmentation, mixed rigid-deformable scenes, and adaptive time stepping.

The broader significance claimed for the method is that compact, geometry-aware object anchoring together with rigid manifold projection provides a strong inductive bias for mesh-free rigid dynamics (Dou et al., 9 May 2026). This suggests a shift from dense vertex graphs toward object-level attention and anchor-level updates as a way to preserve contact fidelity while improving speed and scalability. The reported variable- $M$ 24 conditioning also suggests a practical bridge between coarse long-horizon prediction and fine temporal refinement, and the preliminary articulated results indicate a possible path toward unified mesh-free models of more complex rigid dynamics.

Markdown Report Issue Upgrade to Chat

References (1)

RigidFormer: Learning Rigid Dynamics using Transformers (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RigidFormer.