- The paper introduces AutoMoMa, a GPU-accelerated system that unifies mobile base locomotion with high-DoF manipulation using Augmented Kinematic Representation for scalable trajectory synthesis.
- The paper demonstrates a GPU-based pipeline achieving up to 5,000 episodes per GPU-hour, producing over 500,000 physically valid trajectories across diverse scenes and robot morphologies.
- The paper highlights that scaling data diversity, rather than architectural changes alone, is crucial for robust policy learning and cross-environment generalization in whole-body mobile manipulation.
Scalable Trajectory Generation for Whole-Body Mobile Manipulation
Introduction and Motivation
The challenge of whole-body mobile manipulation lies in the combinatorial explosion of the configuration space induced by coupling mobile base locomotion with high-DoF manipulation in articulated, unstructured environments. Existing teleoperation and planning strategies are either labor-intensive, narrow in scope, or prohibitively constrained in scalability. Consequently, the lack of scalable acquisition pipelines for coordinated, physically valid whole-body trajectories has been a persistent bottleneck, stalling progress toward robust, generalizable learning for mobile manipulation.
"Scalable Trajectory Generation for Whole-Body Mobile Manipulation" (2604.12565) proposes AutoMoMa, a GPU-accelerated system that overcomes these data limitations by unifying Augmented Kinematic Representation (AKR)—which models the base, manipulator, and articulated objects into a single serial kinematic chain—with massively parallel trajectory optimization and collision checking. Notably, AutoMoMa achieves efficient synthesis of over 500,000 whole-body trajectories spanning hundreds of diverse scenes, objects, and robot morphologies.
AutoMoMa Framework and Methodology
AutoMoMa's pipeline is modularized into four stages: task specification, problem instantiation, trajectory generation, and rendering. Critical to throughput and fidelity is the AKR, which serializes the combined kinematic topology of the robot and manipulated object through automatic inversion and attachment procedures. The mobile base is abstracted by two prismatic and one revolute joint, and object kinematics are rigorously rooted at the grasp pose, enabling strict enforcement of task-specific and physical constraints within a unified configuration space.
Trajectory optimization leverages GPU-based acceleration to batch both motion and collision queries, using sphere-based collision approximations to scale across hundreds of thousands of episodes. The framework supports configuration diversity by sampling and clustering inverse kinematics (IK) solutions across multiple grasp points and initial states for both rigid and articulated targets. Notably, for multi-stage tasks (e.g., constrained cabinet opening), grasp switching and intermediate waypoints are explicitly synthesized.
Experimental Characterization of Data Scale, Diversity, and Policy Dependence
The resulting dataset consists of more than 500,000 physically valid, high-fidelity trajectories, each with synchronized RGB-D data and point cloud observations. AKR-based planning ensures kinematic and physical soundness, while the GPU-driven pipeline maintains throughput up to 5,000 episodes per GPU-hour—an 80× speedup over prior baselines.
Empirical analysis reveals:
- Configuration space coverage: Sampling and clustering across grasp and IK solutions ensures wide spatial and kinematic coverage; see the distinct distribution between start and goal base placements.
- Throughput-sensitivity to clutter: While generation rate is high in open layouts, dense clutter escalates collision checking, reducing throughput and the share of feasible solutions.
- Whole-body dynamical adaptation: Analysis of translational and rotational effort demonstrates that the planner dynamically allocates base and arm movement to compensate for geometric constraints, a capacity unavailable to pipelines synthesizing decoupled or partially coordinated motion.
Policy Learning and Data Scaling Laws
Downstream evaluation centers on imitation learning (IL) using state-of-the-art architectures: DP3 (3D point cloud diffusion), DP (image-based diffusion), and ACT (action chunking Transformer). Conditioning on multiple environments and robot morphologies, several key results emerge:
- Fixed-base vs. mobile-base scaling: Fixed-base policies saturate with <800 trajectories in a single scene; mobile manipulation policies require orders of magnitude more data to achieve comparable performance due to the enlarged and highly nonconvex configuration manifold.
- Data diversity is essential: Scaling the number of training environments and increasing trajectory density both independently and jointly improve generalization. Memorization of configuration manifolds is insufficient—scene-level and object diversity is crucial.
- Architectural invariance: Data-driven performance improvements are architecture-agnostic; both image- and point-based diffusion policies, as well as Transformer-based policies, show consistent gains with rising trajectory scale.
- Cross-object and cross-environment generalization: At 100k-scale, policies generalize robustly across multiple objects (as tested on canonical SAPIEN categories) and diverse layouts, with success rates on unseen environments and unseen configurations closely tracking those on seen distributions.
- Execution on real hardware: Trajectories synthesized in simulation are executed on a UR5-Ridgeback platform, validating zero-collision, constraint-compliant execution on complex tasks involving drawers and doors.

Figure 1: Drawer-opening trajectory executed on a real-world UR5-Ridgeback, transferring simulation-valid plans to hardware with high fidelity.
Figure 2: Example fixed-grasp trajectory demonstrating configuration and constraint adherence in a non-switching scenario.
Implications, Limitations, and Future Directions
AutoMoMa establishes, for the first time, the empirical necessity of large-scale, physically grounded data for advancing coordination policies in whole-body mobile manipulation. The work shows that for standard benchmarks, tens of thousands of task-specific demonstrations are required for state-of-the-art IL to reach approximately 80% success, directly contradicting the assumption that algorithmic improvements alone can address the generalization gap.
The practical implications extend to any robotics discipline requiring scalable, generalizable synthesis of complex, high-dof trajectory distributions. By fully automating planning at scale, the pipeline obviates the need for resource-intensive teleoperation, narrow-purpose dataset curation, or reliance on sample-inefficient reinforcement learning exploration.
Limitations stem primarily from reliance on known scene kinematics and the use of sphere-based vision collision for efficiency, which may fail on highly non-convex objects or in dynamic, human-robot interaction scenarios. Future directions include incorporation of deformable object modeling, on-policy data generation, extension to dynamic scenes, and lowering the sim-to-real gap via joint learning-planning pipelines and community-driven expansions to new embodiments.
Conclusion
AutoMoMa provides a robust, extensible solution to the core bottleneck in whole-body mobile manipulation: scalable, high-fidelity trajectory generation for downstream generalization. Empirical evidence demonstrates that generalizable, robust policy learning is ultimately constrained not by learning architectures but by data scale and diversity—a challenge directly addressed by the unification of AKR modeling with GPU-accelerated planning at unprecedented scale. This work provides a foundation for future advancements in scalable embodied AI and closes the gap between planning and learning for complex robotic systems.