ByteLoom: HOI Video Synthesis Framework
- ByteLoom is a framework for HOI video synthesis that leverages a latent Diffusion Transformer and multi-view RCM caches to ensure geometric realism.
- It integrates human pose, per-frame RCM, and multi-view object data through latent fusion and a progressive curriculum learning strategy for optimal motion fidelity.
- Benchmarking and ablation studies demonstrate that ByteLoom significantly outperforms previous methods on metrics like Obj-IoU, CLIP similarity, and temporal consistency.
ByteLoom is a framework for human-object interaction (HOI) video generation, synthesizing sequences that maintain geometric consistency and multi-view realism of manipulated objects without dependence on fine-grained hand mesh annotations. Built on a latent Diffusion Transformer (DiT) architecture, ByteLoom leverages Relative Coordinate Map caches and a progressive curriculum learning regime to deliver state-of-the-art results across object consistency and HOI motion fidelity benchmarks (Liu et al., 28 Dec 2025).
1. Architectural Foundation
ByteLoom utilizes a latent DiT backbone with a “latent fusion” frontend that integrates multiple conditioning signals into the input latent space. Each video synthesis involves the following conditioning:
- Human reference image (either untextured or VAE-encoded)
- Per-frame 2D human pose via OpenPose or DWPose joints
- Multi-view object data from an RCM-cache (each entry: RGB image and corresponding RCM)
- Per-frame RCM specifying the object’s 6-DoF pose (rotation, translation)
Condition latents are concatenated along appropriate dimensions (channel for pose/RCM, temporal for cached RCM-RGB) and processed via a small MLP (“latent fuser”). The DiT module follows standard U-Net-like architecture: time-embedded, ε-noised latents progress through multiple Transformer blocks with cross-attention over fused inputs, producing denoised latents for VAE decoding into video frames.
The stepwise synthesis pipeline is:
1 2 3 4 5 6 7 8 9 10 11 12 |
for t = T ↓ to 1 do z_t = NoiseSchedule.step(z_{t+1}, t) c = concat_along_channels( MLP_pose(human_pose_t), per_frame_RCM_t, MLP_cache(RCM_cache_images, RCM_cache_maps), MLP_img(human_reference) ) ε_θ = DiT(z_t, c, t) z_{t-1} = z_t - α_t * ε_θ end output_video = VAE.decode(z_0) |
Here, α_t is the usual diffusion step size.
2. Relative Coordinate Map (RCM) Cache
A unique innovation of ByteLoom lies in the RCM-cache, a persistent dictionary containing sparse-view object references and their geometric priors. Each cache entry stores a (RGB, RCM) pair. The RCM for 3D object mesh vertices is defined by:
Pixels during rendering are set by barycentric interpolation:
This mapping encodes each pixel's normalized 3D location on the object's surface. During synthesis, K cached RGB+RCM pairs are loaded and concatenated temporally. Independently, a fresh per-frame RCM is rendered under the target pose and concatenated along the channel axis, providing direct geometric control signals. Through joint cross-attention on cached and per-frame RCM, ByteLoom achieves fine control of object view synthesis and accurate 6-DoF alignment across video frames.
3. Progressive Curriculum Learning
To address the scarcity of full HOI data, ByteLoom employs a three-stage progressive curriculum scheme:
- Stage I — Human Pose Pretraining: Utilizes ∼6.4M clips of object-free human motion annotated with pose. Model is trained under standard diffusion reconstruction loss to synthesize temporally coherent animations conditioned on skeleton joints.
- Stage II — Hand–Object Interaction Pretraining: Trains on ∼550K clips with hand joints and textured objects (e.g., dexycb, HO3D, ARCTIC). Conditioning now includes RCM-cache and per-frame RCM, teaching contacts and local occlusion.
- (Optional) Object-Only NVS Stage: Synthetic data for object-only scenes encourages cross-view consistency but yields marginal gains for full HOI; inclusion is optional.
- Stage III — Full HOI Finetuning: Uses ∼45K studio HOI clips featuring full body, object, 6-DoF, and RCM data. Model is trained to synthesize coordinated manipulation and appearance coherence using diffusion loss:
Learning rate is linearly warmed up to , then decayed via cosine schedule over 400 finetuning steps.
4. Annotation Requirements: Mitigation of Hand Mesh Dependence
Unlike prior works that require dense frame-wise hand–object mesh annotations (cf. AnchorCrafter, ManiVideo), ByteLoom substitutes these with 2D skeletons from OpenPose/DWPose (23 body, 21 hand joints). No explicit hand mesh or 3D hand model is included in diffusion conditioning. The DiT self-attention infers occlusion cues between stick-figure joints and RCM-driven object surfaces, substantially lowering the barrier for dataset creation and enabling broader applicability across videos containing only pose and mesh data.
5. Quantitative Results and Benchmarking
On the Mani4D-Test set (15 sequences × 97 frames, novel humans/objects), ByteLoom is evaluated across object, subject, and motion metrics:
| Metric | Description | ByteLoom vs. Prior |
|---|---|---|
| Obj-IoU | Intersection-over-Union, gen. object mask (SAM2) vs GT | +18% |
| Obj-CLIP | CLIP-based cosine similarity (object crop vs reference) | +18% |
| Face-Cos | AdaFace cosine similarity (gen. vs reference face) | 0.889 (vs 0.577) |
| LMD | Landmark Mean Distance, hand joint consistency (lower is better) | 0.142 (vs 0.263) |
| T-SSIM | Temporal mean SSIM between frames | 0.568 (vs 0.529) |
Additional “vbench” metrics are reported: Subj-Cons, Back-Cons, Mot-Smth for subject/background/motion consistency. ByteLoom consistently outperforms AnchorCrafter, UniAnimate-DiT, and MimicMotion across all metrics, demonstrating robust identity preservation and geometry fidelity.
6. Ablation Analyses and Generalization
Comprehensive ablation studies reveal:
- RCM-cache ablation: Removing RCM inputs (multi-view cache and per-frame) drops Obj-IoU from 0.829 to 0.662 and T-SSIM from 0.568 to 0.558.
- Curriculum ablation: Omitting Stage II (hand–object data) results in increased LMD (0.205 vs 0.143) and decreased Obj-IoU (0.763 vs 0.829). Optional object-only NVS gives only marginal impacts, potentially detrimental to HOI performance.
- Generalization to novel inputs: ByteLoom handles unseen human reference images and untrained 3D object meshes with only small drops in Obj-CLIP and LMD.
This suggests strong generalization capabilities for in-the-wild HOI video synthesis, appropriate for diverse digital, commercial, and robotic manipulation contexts (Liu et al., 28 Dec 2025).