Scalable Trajectory Generation for Whole-Body Mobile Manipulation

Published 14 Apr 2026 in cs.RO and cs.CV | (2604.12565v1)

Abstract: Robots deployed in unstructured environments must coordinate whole-body motion -- simultaneously moving a mobile base and arm -- to interact with the physical world. This coupled mobility and dexterity yields a state space that grows combinatorially with scene and object diversity, demanding datasets far larger than those sufficient for fixed-base manipulation. Yet existing acquisition methods, including teleoperation and planning, are either labor-intensive or computationally prohibitive at scale. The core bottleneck is the lack of a scalable pipeline for generating large-scale, physically valid, coordinated trajectory data across diverse embodiments and environments. Here we introduce AutoMoMa, a GPU-accelerated framework that unifies AKR modeling, which consolidates base, arm, and object kinematics into a single chain, with parallelized trajectory optimization. AutoMoMa achieves 5,000 episodes per GPU-hour (over $80\times$ faster than CPU-based baselines), producing a dataset of over 500k physically valid trajectories spanning 330 scenes, diverse articulated objects, and multiple robot embodiments. Prior datasets were forced to compromise on scale, diversity, or kinematic fidelity; AutoMoMa addresses all three simultaneously. Training downstream IL policies further reveals that even a single articulated-object task requires tens of thousands of demonstrations for SOTA methods to reach $\approx 80\%$ success, confirming that data scarcity -- not algorithmic limitations -- has been the binding constraint. AutoMoMa thus bridges high-performance planning and reliable IL-based control, providing the infrastructure previously missing for coordinated mobile manipulation research. By making large-scale, kinematically valid training data practical, AutoMoMa showcases generalizable whole-body robot policies capable of operating in the diverse, unstructured settings of the real world.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces AutoMoMa, a GPU-accelerated system that unifies mobile base locomotion with high-DoF manipulation using Augmented Kinematic Representation for scalable trajectory synthesis.
The paper demonstrates a GPU-based pipeline achieving up to 5,000 episodes per GPU-hour, producing over 500,000 physically valid trajectories across diverse scenes and robot morphologies.
The paper highlights that scaling data diversity, rather than architectural changes alone, is crucial for robust policy learning and cross-environment generalization in whole-body mobile manipulation.

Scalable Trajectory Generation for Whole-Body Mobile Manipulation

Introduction and Motivation

The challenge of whole-body mobile manipulation lies in the combinatorial explosion of the configuration space induced by coupling mobile base locomotion with high-DoF manipulation in articulated, unstructured environments. Existing teleoperation and planning strategies are either labor-intensive, narrow in scope, or prohibitively constrained in scalability. Consequently, the lack of scalable acquisition pipelines for coordinated, physically valid whole-body trajectories has been a persistent bottleneck, stalling progress toward robust, generalizable learning for mobile manipulation.

"Scalable Trajectory Generation for Whole-Body Mobile Manipulation" (2604.12565) proposes AutoMoMa, a GPU-accelerated system that overcomes these data limitations by unifying Augmented Kinematic Representation (AKR)—which models the base, manipulator, and articulated objects into a single serial kinematic chain—with massively parallel trajectory optimization and collision checking. Notably, AutoMoMa achieves efficient synthesis of over 500,000 whole-body trajectories spanning hundreds of diverse scenes, objects, and robot morphologies.

AutoMoMa Framework and Methodology

AutoMoMa's pipeline is modularized into four stages: task specification, problem instantiation, trajectory generation, and rendering. Critical to throughput and fidelity is the AKR, which serializes the combined kinematic topology of the robot and manipulated object through automatic inversion and attachment procedures. The mobile base is abstracted by two prismatic and one revolute joint, and object kinematics are rigorously rooted at the grasp pose, enabling strict enforcement of task-specific and physical constraints within a unified configuration space.

Trajectory optimization leverages GPU-based acceleration to batch both motion and collision queries, using sphere-based collision approximations to scale across hundreds of thousands of episodes. The framework supports configuration diversity by sampling and clustering inverse kinematics (IK) solutions across multiple grasp points and initial states for both rigid and articulated targets. Notably, for multi-stage tasks (e.g., constrained cabinet opening), grasp switching and intermediate waypoints are explicitly synthesized.

Experimental Characterization of Data Scale, Diversity, and Policy Dependence

The resulting dataset consists of more than 500,000 physically valid, high-fidelity trajectories, each with synchronized RGB-D data and point cloud observations. AKR-based planning ensures kinematic and physical soundness, while the GPU-driven pipeline maintains throughput up to 5,000 episodes per GPU-hour—an 80× speedup over prior baselines.

Empirical analysis reveals:

Configuration space coverage: Sampling and clustering across grasp and IK solutions ensures wide spatial and kinematic coverage; see the distinct distribution between start and goal base placements.
Throughput-sensitivity to clutter: While generation rate is high in open layouts, dense clutter escalates collision checking, reducing throughput and the share of feasible solutions.
Whole-body dynamical adaptation: Analysis of translational and rotational effort demonstrates that the planner dynamically allocates base and arm movement to compensate for geometric constraints, a capacity unavailable to pipelines synthesizing decoupled or partially coordinated motion.

Policy Learning and Data Scaling Laws

Downstream evaluation centers on imitation learning (IL) using state-of-the-art architectures: DP3 (3D point cloud diffusion), DP (image-based diffusion), and ACT (action chunking Transformer). Conditioning on multiple environments and robot morphologies, several key results emerge:

Fixed-base vs. mobile-base scaling: Fixed-base policies saturate with <800 trajectories in a single scene; mobile manipulation policies require orders of magnitude more data to achieve comparable performance due to the enlarged and highly nonconvex configuration manifold.
Data diversity is essential: Scaling the number of training environments and increasing trajectory density both independently and jointly improve generalization. Memorization of configuration manifolds is insufficient—scene-level and object diversity is crucial.
Architectural invariance: Data-driven performance improvements are architecture-agnostic; both image- and point-based diffusion policies, as well as Transformer-based policies, show consistent gains with rising trajectory scale.
Cross-object and cross-environment generalization: At 100k-scale, policies generalize robustly across multiple objects (as tested on canonical SAPIEN categories) and diverse layouts, with success rates on unseen environments and unseen configurations closely tracking those on seen distributions.
Execution on real hardware: Trajectories synthesized in simulation are executed on a UR5-Ridgeback platform, validating zero-collision, constraint-compliant execution on complex tasks involving drawers and doors.

Figure 1: Drawer-opening trajectory executed on a real-world UR5-Ridgeback, transferring simulation-valid plans to hardware with high fidelity.

Figure 2: Example fixed-grasp trajectory demonstrating configuration and constraint adherence in a non-switching scenario.

Implications, Limitations, and Future Directions

AutoMoMa establishes, for the first time, the empirical necessity of large-scale, physically grounded data for advancing coordination policies in whole-body mobile manipulation. The work shows that for standard benchmarks, tens of thousands of task-specific demonstrations are required for state-of-the-art IL to reach approximately 80% success, directly contradicting the assumption that algorithmic improvements alone can address the generalization gap.

The practical implications extend to any robotics discipline requiring scalable, generalizable synthesis of complex, high-dof trajectory distributions. By fully automating planning at scale, the pipeline obviates the need for resource-intensive teleoperation, narrow-purpose dataset curation, or reliance on sample-inefficient reinforcement learning exploration.

Limitations stem primarily from reliance on known scene kinematics and the use of sphere-based vision collision for efficiency, which may fail on highly non-convex objects or in dynamic, human-robot interaction scenarios. Future directions include incorporation of deformable object modeling, on-policy data generation, extension to dynamic scenes, and lowering the sim-to-real gap via joint learning-planning pipelines and community-driven expansions to new embodiments.

Conclusion

AutoMoMa provides a robust, extensible solution to the core bottleneck in whole-body mobile manipulation: scalable, high-fidelity trajectory generation for downstream generalization. Empirical evidence demonstrates that generalizable, robust policy learning is ultimately constrained not by learning architectures but by data scale and diversity—a challenge directly addressed by the unification of AKR modeling with GPU-accelerated planning at unprecedented scale. This work provides a foundation for future advancements in scalable embodied AI and closes the gap between planning and learning for complex robotic systems.

Markdown Report Issue