RigidFormer: Learning Rigid Dynamics using Transformers

Published 9 May 2026 in cs.CV, cs.AI, cs.GR, cs.LG, and cs.RO | (2605.09196v1)

Abstract: Learning-based simulation of multi-object rigid-body dynamics remains difficult because contact is discontinuous and errors compound over long horizons. Most existing methods remain tied to mesh connectivity and vertex-level message passing, which limits their applicability to mesh-free inputs such as point clouds and leads to high computational cost. Efficiently modeling high-fidelity rigid-body dynamics from mesh-free representations, therefore, remains challenging. We introduce RigidFormer, an object-centric Transformer-based model that learns mesh-free rigid-body dynamics with controllable integration step sizes. RigidFormer reasons at the object level and advances each object through compact anchors; Anchor-Vertex Pooling enriches these anchors with local vertex features, retaining contact-relevant geometry without dense vertex-level interaction. We propose Anchor-based RoPE to inject anchor geometry into attention while respecting the unordered nature of objects and anchors: object-token processing is permutation-equivariant, and the mean-pooled anchor descriptor is invariant to anchor reindexing while preserving shape extent. RigidFormer further enforces rigidity by projecting updates onto the rigid-body manifold using differentiable Kabsch alignment. On standard benchmarks, RigidFormer outperforms or matches mesh-based baselines using point inputs, runs faster, generalizes to unseen point resolutions and across datasets, and scales to 200+ objects; we also show a preliminary extension to command-conditioned articulated bodies by treating body parts as interacting object-level components.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a mesh-free neural simulator that leverages an object-centric Transformer to efficiently predict multi-object rigid-body dynamics.
It employs innovative techniques such as anchor-based state encoding, ARoPE for geometry-aware attention, and differentiable SE(3) projection to enforce rigidity.
Experimental results highlight lower orientation RMSE, improved computational speed (23.9 FPS), and robustness even with partial observations and large-scale scenes.

RigidFormer: Transformer-Based Mesh-Free Rigid-Body Dynamics Prediction

Introduction

"RigidFormer: Learning Rigid Dynamics using Transformers" (2605.09196) introduces a novel, mesh-free neural simulator for multi-object rigid-body dynamics, formulated around the Transformer architecture. Conventional learned simulators generally rely on mesh connectivity and vertex-level message passing, incurring high computational overhead and necessitating mesh-based geometric inputs. RigidFormer shifts the paradigm by reasoning at the object level using point clouds and anchor-based representations, thereby achieving computational efficiency, explicit rigidity enforcement, and strong generalization properties. This essay provides a detailed technical overview of the methodology, experimental validation, and implications of RigidFormer, emphasizing its object-centric approach, geometry-aware attention with ARoPE, and robust long-horizon prediction via differentiable rigid projection.

Methodology

Pipeline and Object-Centric Transformer Architecture

RigidFormer operates on sequences of point clouds, where each rigid object is represented by variable-density points with optional control signals and temporal discretization input. The pipeline comprises four main stages:

Encoding: Each object's point cloud is encoded into a single object token via a hierarchical PointNet-based network, aggregating multi-scale geometry and physical parameters (mass, friction, restitution).
Object-Level Transformer: Object tokens are processed via multi-layer Transformers with permutation-equivariant attention, modeling direct object-to-object interactions (object-level self-attention) with optional film-style temporal conditioning.
Anchor-Based State Advancement: Each object selects $N_a$ anchors via FPS, summarizing object state. Compact anchor features, enriched by local context through Anchor-Vertex Pooling (AVP), are used as queries into the object tokens. Anchor dynamics are advanced via a learnable acceleration predictor followed by Verlet integration.
Rigidity Enforcement: Anchor updates are rigidly projected onto the SE(3) manifold via differentiable Kabsch alignment. The resulting transform updates all object points, ensuring exact intra-object rigidity.
Figure 1: RigidFormer pipeline, highlighting object-to-token encoding, anchor selection, object- and anchor-level attention, and rigid body update via differentiable SE(3) projection.

This architecture reduces interaction costs from $O((MN_v)^2)$ in vertex-level models to $O((MN_a)^2)$ , with a typical setup using $M \leq 200$ and $N_a = 4$ , resulting in substantial efficiency gains.

Geometry-Aware Attention: Anchor-based RoPE (ARoPE)

Positional encoding is addressed via Anchor-based Rotary Positional Embedding (ARoPE), which encodes the spatial extent and geometry of each object using a mean-pooled rotary mapping over its set of anchors. This encoding preserves permutation-equivariance over objects and invariance to anchor reindexing, while being efficient and generalizable across arbitrary shape collections and object counts. ARoPE is integrated into the attention mechanism, modulating geometry-encoded queries and keys.

Local Geometry Injection and Rigid Projection

Contact phenomena depend on localized surface interactions, so each anchor, although representing global object motion, aggregates local geometric features from nearby vertices using AVP. This ensures contact geometry is not lost when compressing object state. After predicting anchor updates, rigid transformations are recovered via differentiable SVD-based Kabsch alignment, guaranteeing SE(3) consistency and allowing stable long-horizon rollouts.

Figure 2: Qualitative results: mesh overlays show the accuracy of point-based RigidFormer rollouts; the model uses only point clouds as direct input.

Experimental Evaluation

Benchmarks and Baselines

Evaluations are conducted on the MOVi benchmarks (MOVi-A, MOVi-B, MOVi-Sphere), featuring varying object complexities and shapes. RigidFormer is compared against mesh-based baselines (HopNet, FIGNet, MGN), point-based methods (VPD), and transformer-based models (HCMT).

Numerical Results and Ablations

On all benchmarks, RigidFormer achieves the lowest orientation RMSE and competitive translation RMSE compared to mesh-based counterparts, despite using only point representations. For example, on MOVi-B at 100 frames: HopNet achieves $0.176\,\mathrm{m}/17.91^\circ$ ; RigidFormer improves to $0.161\,\mathrm{m}/15.33^\circ$ , highlighting the benefit of object-level modeling.
Computational speed: RigidFormer delivers $23.9$ FPS, an 8–100× improvement over baselines.
Generalization: Robust performance is demonstrated under variable point cloud resolutions, step sizes, and across unseen datasets, with quality preserved at up to $200+$ objects.
Step size conditioning: Larger integration steps consistently improve long-horizon error by reducing autoregressive compounding.
Partial observations: RigidFormer maintains accurate prediction and contact modeling even with 25% of points randomly missing at inference.
Figure 3: Rollouts on partial point clouds—RigidFormer remains robust, with stable contacts and minimal drift even when 25% of points are masked at test time.
Ablations: The architecture is robust to variations in anchor count, random anchor selection, and benefits consistently from ARoPE over OBB/PCA/SE(3)-based encoding. Gated attention and differentiable Kabsch alignment both improve long-horizon stability and accuracy.

Scalability, Controllability, and Articulated Bodies

RigidFormer is validated on large-scale scenes (up to 216 cubes) and direction-conditioned articulated bodies (humanoid, Unitree G1) by treating each body part as an interacting object. The model follows control commands while maintaining coherent part-level dynamics, indicating extensibility beyond independent rigid bodies.

Figure 4: Left: Stable simulation as object count increases to 216. Right: Articulated agent following heading commands; arrows indicate target directions.

Theoretical and Practical Implications

RigidFormer demonstrates that object-centric, anchor-based Transformer architectures can overcome the traditional mesh-dependency and computational inefficiency in point-based rigid-body simulation. The key implications include:

Inductive Bias: Enforcing SE(3) rigidity via differentiable Kabsch, and structuring interactions at the object level, matches the physics of rigid-body dynamics, yielding sample-efficient, robust, scalable models.
Geometry Generalization: ARoPE provides permutation- and shape-consistent position encoding, supporting arbitrary, unordered shape collections and variable object numbers.
Computational Scaling: Quadratic growth in object count (as opposed to vertex count) makes simulation tractable at unprecedented scales, favoring practical deployment in real-time or large-scene domains.
Versatility: The same model architecture accommodates partial observations and basic articulated body conditioning without any mesh input or per-object tuning.

Potential future developments include extending to scenes with ambiguous perceptual segmentation, handling mixed rigid–deformable objects, and integrating uncertainty quantification for downstream control tasks.

Conclusion

RigidFormer (2605.09196) constitutes a substantive step forward in mesh-free rigid-body dynamics learning. By unifying object-centric perception, anchor-based dynamic summarization, robust geometric attention, and explicit rigidity projection, the method achieves highly efficient, generalizable, and accurate simulation on realistic multi-object scenes. Its scalability, generalization, and mesh-free flexibility set a new standard for learned physical dynamics simulation and open avenues for further research into generalized world modeling, mixed-material systems, and neural-physical hybrid planning.

Figure 5: Strong qualitative results on partial observations, diverse MOVi datasets, and articulated agent scenarios, demonstrating robustness and versatility.

Markdown Report Issue