- The paper introduces a novel framework that leverages flow-matching and physics-driven simulation for stable and synchronized human-human co-manipulation.
- It employs an affordance-informed contact strategy and adversarial interaction priors to achieve intention-driven manipulation with high contact accuracy.
- Experimental results on the Core4D dataset demonstrate improved metrics in spatial alignment, motion diversity, and physical plausibility compared to prior methods.
Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation
This work addresses the complexity inherent in synthesizing coordinated human motion for object-guided human-human co-manipulation tasks. The challenge extends beyond typical single-agent or purely social multi-agent motion generation methods, requiring triadic interaction modeling between two humans and a manipulated object. The goal is to achieve intention-driven manipulation (goal alignment via object affordance), motion naturalness (responsive and synchronized inter-agent behavior), and effectiveness (physical stability and plausibility). Prior approaches either lack explicit object-conditioned guidance, suffer from artifacts in dual-agent modeling, or fail to incorporate physics-based feedback essential for stability during co-manipulation.
Framework Overview
The proposed solution integrates three principal modules: (i) flow-matching for deterministic, likelihood-based motion synthesis conditioned on object geometry and trajectory; (ii) an affordance-informed manipulation strategy to anchor hand-object contact based on object affordance fields; and (iii) adversarial interaction priors and physics-driven simulation to enforce motion realism, inter-person coordination, and physical stability.
Figure 1: Overview of the object-guided co-manipulation framework, illustrating the integration of flow matching, affordance-influenced contact guidance, adversarial interaction prior, and stability-driven simulation.
Technical Methodology
Flow-Matching Motion Generation
The core of the framework is a transformer-based flow-matching model, learning a continuous vector field that maps noise to valid interaction sequences under object trajectory guidance. SMPL-X representations encode articulated pose and shape for each human. Object information is integrated via BPS descriptors and rigid pose trajectories. Motion synthesis involves deterministic Euler integration, minimizing weighted objectives comprising flow regression, per-joint L1โ loss, foot-contact stabilization, and adversarial prior terms.
To ensure intention-driven manipulation, a dense object affordance regressor identifies graspable surface zones for contact anchor generation. A diffusion model conditioned on affordance and BPS features produces explicit hand-object contact strategies, constrained via anchor, normal, and affordance alignment losses. During flow-matching inference, differentiable loss gradients enforce wrist alignment to predicted contact anchors, promoting coherent, diverse, and semantically valid co-manipulation plans.
Adversarial Interaction Prior
Motion realism and inter-agent coordination are imposed by adversarial discriminators trained on real versus synthesized SMPL-X pose and inter-person interaction sequences. The pose prior refines single-body articulation; the interaction prior penalizes temporally misaligned dual-human responses. During inference, gradients from these discriminators bias flow evolution toward physically and socially plausible regions of the motion manifold.
Stability-Driven Simulation
Physical plausibility and manipulation stability are refined via a simulation-in-the-loop strategy. The predicted motions are evaluated using a PD-controlled physics simulator, with corrective pose offsets sampled and optimized via CMA-ES to minimize physical cost functions (similarity and stability losses). Resulting physically valid trajectories are propagated to the next Euler integration step, closing the synthesis loop with explicit stability feedback.
Figure 2: Stability-driven simulation pipeline showing CMA-ES-based corrective offset sampling, PD-controlled physics simulation, and iterative cost evaluation for trajectory refinement.
Quantitative and Qualitative Results
The framework outperforms state-of-the-art adapted baselines (ComMDM, OMOMO, InterGen) on the Core4D dataset in both interaction and motion quality. Metrics include IDF, Contact Accuracy, Penetration Depth, FID, and Diversity. Empirically, the proposed approach achieves:
- Lower IDF (0.22) and penetration (0.05), signifying improved spatial alignment and physical plausibility.
- Higher contact accuracy (0.44), outperforming previous models lacking explicit contact or dual-agent constraints.
- Superior FID and diversity, indicating distributional fidelity and motion variation.
Qualitative evaluation demonstrates robust synchronized co-manipulation, fine-grained grasp adjustments, and stability throughout manipulation, contrasting with artifacts and contact failures in prior methods.
Figure 3: Comparison on Core4D-S1, highlighting stable and coordinated manipulation in the proposed approach versus baselines with misaligned or unstable contacts.
Figure 4: Cooperative motions generated by the framework, demonstrating synchronized lifting and steering with continuous grasp adjustment.
Ablation Study and Component Analysis
Ablation studies confirm the critical role of each module:
- Removing simulation induces an increase in penetration depth and reduces contact accuracy.
- Incorporation of interaction priors improves IDF and FID, enhancing inter-agent and motion realism.
- Affordance-guided contact anchors significantly boost hand-object alignment.
Figure 5: Ablation results illustrating progressive improvement in interaction realism, contact accuracy, and physical plausibility with the addition of affordance, prior, and simulation modules.
Implications and Future Directions
This architecture sets a new technical standard for human-human co-manipulation, establishing clear pathways for interpretable, intention-driven, physically plausible motion generation in triadic interactions. Integration of affordance modeling, adversarial priors, and simulation-based stability can be readily adapted to multi-agent, variable-payload scenarios, and potentially scaled to higher-order collaborative tasks in robotics and virtual reality. The unification of deterministic likelihood-based synthesis with simulator-in-the-loop refinement addresses critical generalization and stability issues absent in purely RL or diffusion-based policies. Extension to real-time inference, more complex social-object tasks, and transfer across unseen object geometries are promising future research directions.
Conclusion
A comprehensive framework for object-guided human-human co-manipulation is formulated, leveraging flow-matching for deterministic synthesis, explicit affordance guidance for contact strategy, adversarial priors for motion realism, and stability-driven simulation for physical plausibility. Quantitative and qualitative results confirm its superiority in intention-driven, stable collaborative manipulation, providing a robust canonical foundation for future advances in embodied multi-agent interaction modeling (2604.20336).