Decoupled Generative Modeling for HOI Synthesis
- The paper introduces a decoupled generative model that separates trajectory planning from action synthesis to enhance physical realism and training stability.
- It employs transformer-based diffusion networks and adversarial losses to achieve collision-free, contact-consistent, and semantically aligned 3D human-object interactions.
- The model integrates dynamic planning and re-planning strategies to support responsive, long-horizon multi-agent scenarios with superior quantitative and perceptual performance.
Decoupled Generative Modeling for Human-Object Interaction Synthesis (DecHOI) is a modeling paradigm for synthesizing physically and semantically plausible 3D human-object interactions by partitioning the generative process into distinct path-planning and action-synthesis modules. This approach, motivated by the limitations of monolithic generative models, enables improved training stability, physical realism, contact consistency, and generalization to dynamic scenes. DecHOI achieves superior performance relative to prior methods across quantitative, qualitative, and perceptual metrics (Jung et al., 22 Dec 2025).
1. Motivation and Conceptual Foundations
Traditional methods for 3D human-object interaction (HOI) synthesis utilize a single generative model—commonly denoising diffusion or autoregressive architectures—to simultaneously plan trajectories and synthesize detailed joint and object poses. This unified treatment increases objective complexity, produces entangled errors (e.g., mesh penetrations, unsynchronized contact), and requires laborious manual waypoint specification for both agents. These drawbacks manifest as increased training instability, error-prone contact, and reduced flexibility in dynamic or long-horizon tasks.
DecHOI introduces an explicit decoupling: a trajectory generator (TG) plans collision-free, coarse trajectories for humans and objects without manual intervention; an action generator (AG) then synthesizes contact-aware, fine-grained motions conditioned on these paths. This partitioning assigns each network a lower-dimensional, specialized subtask, yielding reduced optimization complexity, improved convergence, and robust mitigation of contact and synchronization failures.
2. Model Architecture and Mathematical Formulation
DecHOI's architecture consists of two transformer-based diffusion networks—TG and AG—with distinct roles and conditioning strategies.
2.1 Trajectory Generator (TG)
- Inputs: Noisy human root positions, object global poses (over all frames), object geometry encoded via a small MLP into , and a CLIP-based text embedding .
- Architecture: Transformer denoising diffusion model with cross-attention to .
- Outputs: Collision-free trajectories
- Human:
- Object:
- Formulation:
where is diffusion noise and summarizes conditioning information.
2.2 Action Generator (AG)
- Inputs: All TG inputs except root/object positions (replaced by TG outputs ); original noisy per-joint rotations and 6D representations.
- Architecture: Transformer diffusion backbone (shares TG backbone, omits text cross-attention).
- Outputs: Fine-grained human joint rotations (6D, ) and object pose matrices ().
- Formulation:
where .
2.3 Training Losses
- Trajectory loss: Position reconstruction and velocity smoothness
- Action loss: Rotation reconstruction with forward-kinematics supervision (hands/feet)
- Adversarial loss: Discriminator focuses on the dynamics of distal joints (hand/foot positions, sampled object points)
- Total objective:
3. Dynamic, Long-Sequence and Responsive Planning
DecHOI supports long-horizon HOI synthesis and multi-agent scenes through explicit planning and online re-planning mechanisms:
- A*Planner: Operates on a 2D navigable grid for root and object trajectories.
- Counterpart Prediction: Utilizes Social-STGCNN to forecast other agents' paths; intersections of influence regions trigger re-planning.
- Re-planning Logic: A* planner selects detours or “wait” actions; revised trajectory is recomputed and passed to AG for detailed motion synthesis.
- Outcome: Guarantees collision-free, intent-consistent joint paths across arbitrary scene timescales.
4. Benchmarking, Datasets, and Evaluation
Experiments are conducted on FullBodyManipulation (10 h, 15 rigid objects, 15 train/2 test subjects) and 3D-FUTURE (unseen furniture). DecHOI is evaluated on the following metrics:
- Condition Matching: Start/end position error (, ; cm)
- Human Motion Quality: Foot height (), foot sliding (), FID, diversity ()
- Interaction Quality: Contact precision/recall/F1, penetration depth (, )
- Ground-truth Error: MPJPE, root/object translation and orientation errors
Quantitative Results
| Method | FID ↓ | ↑ | ↓ | MPJPE ↓ | ↓ | ↓ |
|---|---|---|---|---|---|---|
| CHOIS | 1.58 | 0.59 | 0.66 | 18.86 | 1.92 | 8.01 |
| HOIFHLI | 2.06 | 0.64 | 0.58 | 19.31 | 1.73 | 7.65 |
| DecHOI | 0.33 | 0.67 | 0.53 | 15.27 | 1.59 | 6.91 |
On 3D-FUTURE:
| Method | FID ↓ | ↑ | ↓ | ↓ |
|---|---|---|---|---|
| CHOIS | 2.04 | 0.46 | 0.18 | 5.75 |
| DecHOI | 1.01 | 0.48 | 0.15 | 4.27 |
Qualitative and perceptual studies indicate DecHOI exhibits significantly fewer object and joint penetrations, improved contact fidelity, and better semantic alignment, with user preference exceeding 60% for both “Text Alignment” and “Interaction Quality” over CHOIS/HOIFHLI (Jung et al., 22 Dec 2025).
5. Comparison to Related Decoupled and Modular HOI Approaches
The decoupled philosophy underlying DecHOI aligns with broader trends in HOI and scene synthesis:
- Decoupled pose-contact-placement pipelines for 3D human pose generation in scenes (Dang et al., 2024) demonstrate that modular priors improve diversity (Cluster Size 0.90 vs. COINS 0.61), and physical plausibility (Non-Collision 0.99).
- CoopDiff employs dual diffusion branches for human and object, linked via contact-point consistency and human-driven adapters, further reducing MPJPE and penetration, and outperforming joint models (Lin et al., 10 Aug 2025).
- Semantic-dynamics decoupling in InterDreamer leverages pretrained LLMs and a world model to achieve zero-shot 3D HOI—without direct text-interaction pair supervision (Xu et al., 2024).
A plausible implication is that enforced modularity/decoupling often yields improvements in generalization, robustness to cluttered or unseen environments, and mitigates failure cases endemic to monolithic or purely joint models.
6. Conclusions, Limitations, and Future Directions
DecHOI’s contributions are:
- A two-stage, decoupled pipeline that obviates manual waypoints, decomposes optimization, and achieves state-of-the-art synthesis of collision-free, semantically aligned human-object interactions.
- Adversarial contact loss on distal joints to improve realism at critical contact sites.
- A dynamic planner permitting responsive, long-horizon, multi-agent motion.
Known limitations and directions for further research include:
- Incorporating full-body physical constraints to ensure balance and plausible force dynamics.
- Extension to articulated and deformable objects (e.g., doors, drawers).
- Learning flexible, end-to-end re-planning policies (replacing A*) to enhance adaptability in complex or highly cluttered environments (Jung et al., 22 Dec 2025).
These advances position DecHOI and related decoupled architectures as foundational for next-generation animation, robotics, and embodied AI.