A Total Lagrangian Finite Element Framework for Multibody Dynamics: Part II -- GPU Implementation and Numerical Experiments

Published 11 Apr 2026 in cs.CE | (2604.10357v1)

Abstract: We present the numerical methods and GPU-accelerated implementation underlying a Total Lagrangian finite element framework for finite-deformation flexible multibody dynamics, introduced in the companion paper [1]. The framework supports 10-node quadratic tetrahedral (T10) elements and ANCF beam and shell elements, with quadrature-based hyperelastic response (St. Venant-Kirchhoff and Mooney-Rivlin) and an optional Kelvin-Voigt viscous stress contribution. Time stepping employs a velocity-based implicit backward-Euler scheme, yielding a nonlinear residual in velocity that couples inertia, internal and external forces, and bilateral constraints. Constraints are enforced via an augmented Lagrangian method (ALM), structured as an outer loop alternating an inner velocity solve with a dual-ascent multiplier update. We introduce a two-stage GPU parallelization strategy for internal force and tangent stiffness evaluation, and provide two inner solvers: a first-order AdamW optimizer and a second-order Newton solver that assembles and factorizes a sparse global Hessian on the GPU using cuDSS. A fixed-sparsity matrix strategy eliminates repeated symbolic analysis and enables efficient numerical refactorization across Newton iterations. For collision detection, we present a GPU-native two-thread asynchronous algorithm operating on triangle soups, avoiding bounding-volume hierarchies entirely. Systematic scaling benchmarks across all three supported element types and six mesh resolutions show that the Newton solver achieves approximately one order of magnitude reduction in real-time factor relative to CPU baselines at the largest resolutions tested. The frictional contact model is validated against closed-form rigid-body predictions through quasi-static and dynamic impact unit tests.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a GPU implementation that drastically reduces simulation time, demonstrating real-time performance on systems with nearly 10^6 DOFs.
It employs a dual-solver approach, using both first-order AdamW and Newton methods, to robustly handle large-deformation and complex contact dynamics.
The framework’s innovative fixed-sparsity assembly and bin-based collision handling ensure scalability and accuracy in flexible multibody dynamics.

GPU-Accelerated Total Lagrangian FEM for Flexible Multibody Dynamics

Introduction and Context

This paper presents a comprehensive GPU implementation of a Total Lagrangian finite element framework for large-deformation, flexible multibody dynamics, as introduced in the companion paper. The framework extends to quadratic 10-node tetrahedral (T10) elements and ANCF (Absolute Nodal Coordinate Formulation) beam and shell elements, providing hyperelastic response via St. Venant–Kirchhoff and Mooney–Rivlin models. The simulation pipeline includes velocity-level implicit time integration, bilateral constraints enforced via an augmented Lagrangian method, and collision/contact handling with frictional dynamics. The principal focus is the translation of this theoretically grounded framework into a performant GPU-resident codebase with novel parallelization, assembly, and contact-handling strategies, aimed at real-time or faster execution for substantial problem sizes (2604.10357).

Numerical Formulation and Solver Architecture

The time integration is performed via an implicit backward-Euler scheme in velocity form, yielding a nonlinear residual that couples inertia, hyperelastic and viscous internal forces, external loads, and constraints in a single velocity solve. Constraints (e.g., joint closure, bilateral kinematics) are enforced using an augmented Lagrangian method, which alternates between (i) an inner velocity solve at fixed multipliers, and (ii) a multiplier update based on the current constraint violation.

Two inner solvers are supported:

First-order AdamW optimizer: Employs gradient-based optimization, requiring only evaluations of the residual and exhibits simple, fully parallel per-DOF updates.
Second-order (Newton) method: Assembles the full sparse global Hessian (mass, stiffness, constraint terms), solves the resulting system via the GPU-native cuDSS sparse-direct Cholesky factorization, and uses fixed-pattern numerical refactorization for efficiency across iterations.

The global assembly leverages persistent, fixed sparsity patterns for all operators (mass matrix, constraint Jacobian, Hessian), allowing amortization of symbolic analysis and supporting rapid block-refactorization.

GPU Parallelization and Assembly

Element-level quantities (forces, tangents) are decomposed for maximal thread-level parallelism. For each element and quadrature point, stress tensors are computed independently, with intermediate results cached for reuse in both force-vector and tangent assembly. Global assembly uses a two-stage approach:

Stress evaluation: One thread per element–quadrature point writes to disjoint positions.
Assembly: One thread per element–node pair computes contributions, accumulating into the global vector using atomic operations. This avoids global communication or reduction bottlenecks, scales with mesh size, and supports all element types.

The Hessian assembly for Newton's method reuses stress caches, employs per-element tangent computation, and accumulates block-contributions into the global matrix using atomics guided by precomputed CSR patterns. Constraint-induced matrix fill is accounted for at initialization, preventing runtime graph reordering.

Collision Detection and Contact Handling

Collision detection employs a BVH-free, bin-based spatial partitioning at the Triangle Soup level, with broad-phase parallelized over bins and triangles. Contact candidates are identified without requiring mesh connectivity. The detection and integration operate as two asynchronous GPU threads:

Kinematics thread: Continuously updates candidate contacts based on enlarged triangle proxies.
Dynamics thread: Processes the active contact set, computes actual penetration and force, and excludes false positives.

Narrow-phase contact uses geometry-projection-based detection, voting for the patch-level contact normal and area. Contact forces are computed with a Hertz–Mindlin spring-dashpot model, enforcing Coulomb friction and rolling resistance, and are accumulated at the patch level to achieve stability under complex contact configurations.

Numerical Results and Performance Analysis

Performance Benchmarks

Systematic scaling on NVIDIA RTX 5090 for T10, ANCF3243, and ANCF3443 element classes demonstrates the framework's real-time factor (RTF) performance up to nearly $10^6$ DOFs. Notably:

Newton solver on GPU delivers approximately an order-of-magnitude reduction in RTF compared to established CPU solvers (FEniCS, Project Chrono) at highest resolutions.
For T10 RES32 (497,310 DOFs), Newton's RTF is 1,250, versus 27,597 for the CPU reference.
Both beam and shell benchmarks show similar scaling and superior performance compared to CPU baselines.
On complex unstructured meshes (e.g., Stanford Bunny, Utah Teapot, tire), Newton GPU solver consistently outperforms FEniCS on CPU by factors of 10–30 in RTF.

Validation

Unit tests validate the physical fidelity of:

Frictional contact (stick-slip transitions on inclined planes, dynamic oblique impacts with restitution and post-impact angular velocity matching analytic predictions in sliding regime).
Bilateral constraint enforcement (double pendulum with revolute and spherical joints, constraint residuals remain below $10^{-8}$ even under large deformation and high damping, with joint force recovery consistent with quasi-static analytical results).

Large-Scale Demonstrations

Large-scale dynamic contact simulations with complex bodies (e.g., dropping a vase onto a foam insert, multi-tire contact in cluttered scenes with shell and solid discretizations) run stably at near one-million-DOF scale. Diagnostics reveal that solver cost correlates more with local contact reorganization and geometry than with raw contact pair count.

Implications and Outlook

This work demonstrates that direct, implicit simulation of highly nonlinear, large-deformation flexible multibody systems with rich contact and constraint structure is feasible at real-time or faster rates for substantial problem sizes on current commodity GPU hardware. The absence of hierarchical collision structures (BVH) in favor of a triangle-soup/bin-based approach streamlines broad-phase collision for deformable bodies, fully utilizing the GPU's throughput characteristics. The fixed-sparsity matrix strategy and GPU-resident precomputation reduce or eliminate bottlenecks associated with symbolic graph analysis and host–device bandwidth.

The explicit validation against analytic and reference solutions for contact and joint response underpins the accuracy of the approach, while the ability to scale to complex and mixed-topology systems affirms its generality.

Future Directions

From a research and engineering standpoint, this framework opens several future directions:

Extension to additional material laws (e.g., plasticity, damage, multiphase coupling).
Adaptive and dynamic meshing or AMR for further performance/resource efficiency.
Enhanced support for contact topology changes (fracture, cutting).
Deployment in full-physics simulation suites for co-simulation with control/AI stacks or for data-driven model generation.
Integration with differentiable programming or adjoint analysis for inversion or optimal control applications in computational mechanics, robotics, and digital twin systems.

Conclusion

This paper delivers a technically robust, GPU-accelerated pipeline for total Lagrangian finite element simulation of flexible multibody dynamics with implicit time integration, nonlinear contact, and general bilateral constraints. It establishes that real-time, large-scale simulations of complex, highly flexible, and highly interactive systems are achievable using a combination of algorithmic design, fixed-sparsity assembly, and GPU-centric parallelization (2604.10357). The approach is validated both against mathematical benchmarks and through practical, large-scale demonstrations, setting a new standard for simulation capability and accessibility at scale in the field.

Markdown Report Issue