Papers
Topics
Authors
Recent
2000 character limit reached

ByteLoom: HOI Video Synthesis Framework

Updated 31 December 2025
  • ByteLoom is a framework for HOI video synthesis that leverages a latent Diffusion Transformer and multi-view RCM caches to ensure geometric realism.
  • It integrates human pose, per-frame RCM, and multi-view object data through latent fusion and a progressive curriculum learning strategy for optimal motion fidelity.
  • Benchmarking and ablation studies demonstrate that ByteLoom significantly outperforms previous methods on metrics like Obj-IoU, CLIP similarity, and temporal consistency.

ByteLoom is a framework for human-object interaction (HOI) video generation, synthesizing sequences that maintain geometric consistency and multi-view realism of manipulated objects without dependence on fine-grained hand mesh annotations. Built on a latent Diffusion Transformer (DiT) architecture, ByteLoom leverages Relative Coordinate Map caches and a progressive curriculum learning regime to deliver state-of-the-art results across object consistency and HOI motion fidelity benchmarks (Liu et al., 28 Dec 2025).

1. Architectural Foundation

ByteLoom utilizes a latent DiT backbone with a “latent fusion” frontend that integrates multiple conditioning signals into the input latent space. Each video synthesis involves the following conditioning:

  • Human reference image (either untextured or VAE-encoded)
  • Per-frame 2D human pose via OpenPose or DWPose joints
  • Multi-view object data from an RCM-cache (each entry: RGB image and corresponding RCM)
  • Per-frame RCM specifying the object’s 6-DoF pose (rotation, translation)

Condition latents are concatenated along appropriate dimensions (channel for pose/RCM, temporal for cached RCM-RGB) and processed via a small MLP (“latent fuser”). The DiT module follows standard U-Net-like architecture: time-embedded, ε-noised latents progress through multiple Transformer blocks with cross-attention over fused inputs, producing denoised latents for VAE decoding into video frames.

The stepwise synthesis pipeline is:

1
2
3
4
5
6
7
8
9
10
11
12
for t = T  to 1 do
  z_t = NoiseSchedule.step(z_{t+1}, t)
  c = concat_along_channels(
        MLP_pose(human_pose_t),
        per_frame_RCM_t,
        MLP_cache(RCM_cache_images, RCM_cache_maps),
        MLP_img(human_reference)
      )
  ε_θ = DiT(z_t, c, t)
  z_{t-1} = z_t - α_t * ε_θ
end
output_video = VAE.decode(z_0)

Here, α_t is the usual diffusion step size.

2. Relative Coordinate Map (RCM) Cache

A unique innovation of ByteLoom lies in the RCM-cache, a persistent dictionary containing sparse-view object references and their geometric priors. Each cache entry stores a (RGB, RCM) pair. The RCM for 3D object mesh vertices {Vi}\{V_i\} is defined by:

CRCM(Vi)=Vibminbmaxbmin[0,1]3C_{\mathrm{RCM}}(V_i) = \frac{V_i - b_{\min}}{b_{\max} - b_{\min}} \in [0, 1]^3

Ci=255CRCM(Vi){0,,255}3C_i = \left\lfloor 255\,C_{\mathrm{RCM}}(V_i) \right\rfloor \in \{0,\dots,255\}^3

Pixels during rendering are set by barycentric interpolation:

Cpixel=αCa+βCb+γCc,α+β+γ=1C_{\mathrm{pixel}} = \alpha C_a + \beta C_b + \gamma C_c, \quad \alpha + \beta + \gamma = 1

This mapping encodes each pixel's normalized 3D location on the object's surface. During synthesis, K cached RGB+RCM pairs are loaded and concatenated temporally. Independently, a fresh per-frame RCM is rendered under the target pose (Rt,Tt)(R_t, T_t) and concatenated along the channel axis, providing direct geometric control signals. Through joint cross-attention on cached and per-frame RCM, ByteLoom achieves fine control of object view synthesis and accurate 6-DoF alignment across video frames.

3. Progressive Curriculum Learning

To address the scarcity of full HOI data, ByteLoom employs a three-stage progressive curriculum scheme:

  • Stage I — Human Pose Pretraining: Utilizes ∼6.4M clips of object-free human motion annotated with pose. Model is trained under standard diffusion reconstruction loss to synthesize temporally coherent animations conditioned on skeleton joints.

Lpose=Ez0,t[ϵθ(zt,t,cpose)ϵ2]\mathcal{L}_{\mathrm{pose}} = \mathbb{E}_{z_0, t} \left[ \left\| \epsilon_\theta(z_t, t, c_{\mathrm{pose}}) - \epsilon \right\|^2 \right]

  • Stage II — Hand–Object Interaction Pretraining: Trains on ∼550K clips with hand joints and textured objects (e.g., dexycb, HO3D, ARCTIC). Conditioning now includes RCM-cache and per-frame RCM, teaching contacts and local occlusion.
  • (Optional) Object-Only NVS Stage: Synthetic data for object-only scenes encourages cross-view consistency but yields marginal gains for full HOI; inclusion is optional.
  • Stage III — Full HOI Finetuning: Uses ∼45K studio HOI clips featuring full body, object, 6-DoF, and RCM data. Model is trained to synthesize coordinated manipulation and appearance coherence using diffusion loss:

LHOI=Ez0,t[ϵθ(zt,t,call)ϵ2]\mathcal{L}_{\mathrm{HOI}} = \mathbb{E}_{z_0, t} \left[ \left\| \epsilon_\theta(z_t, t, c_{\mathrm{all}}) - \epsilon \right\|^2 \right]

Learning rate is linearly warmed up to 1×1051 \times 10^{-5}, then decayed via cosine schedule over 400 finetuning steps.

4. Annotation Requirements: Mitigation of Hand Mesh Dependence

Unlike prior works that require dense frame-wise hand–object mesh annotations (cf. AnchorCrafter, ManiVideo), ByteLoom substitutes these with 2D skeletons from OpenPose/DWPose (23 body, 21 hand joints). No explicit hand mesh or 3D hand model is included in diffusion conditioning. The DiT self-attention infers occlusion cues between stick-figure joints and RCM-driven object surfaces, substantially lowering the barrier for dataset creation and enabling broader applicability across videos containing only pose and mesh data.

5. Quantitative Results and Benchmarking

On the Mani4D-Test set (15 sequences × 97 frames, novel humans/objects), ByteLoom is evaluated across object, subject, and motion metrics:

Metric Description ByteLoom vs. Prior
Obj-IoU Intersection-over-Union, gen. object mask (SAM2) vs GT +18%
Obj-CLIP CLIP-based cosine similarity (object crop vs reference) +18%
Face-Cos AdaFace cosine similarity (gen. vs reference face) 0.889 (vs 0.577)
LMD Landmark Mean Distance, hand joint consistency (lower is better) 0.142 (vs 0.263)
T-SSIM Temporal mean SSIM between frames 0.568 (vs 0.529)

Additional “vbench” metrics are reported: Subj-Cons, Back-Cons, Mot-Smth for subject/background/motion consistency. ByteLoom consistently outperforms AnchorCrafter, UniAnimate-DiT, and MimicMotion across all metrics, demonstrating robust identity preservation and geometry fidelity.

6. Ablation Analyses and Generalization

Comprehensive ablation studies reveal:

  • RCM-cache ablation: Removing RCM inputs (multi-view cache and per-frame) drops Obj-IoU from 0.829 to 0.662 and T-SSIM from 0.568 to 0.558.
  • Curriculum ablation: Omitting Stage II (hand–object data) results in increased LMD (0.205 vs 0.143) and decreased Obj-IoU (0.763 vs 0.829). Optional object-only NVS gives only marginal impacts, potentially detrimental to HOI performance.
  • Generalization to novel inputs: ByteLoom handles unseen human reference images and untrained 3D object meshes with only small drops in Obj-CLIP and LMD.

This suggests strong generalization capabilities for in-the-wild HOI video synthesis, appropriate for diverse digital, commercial, and robotic manipulation contexts (Liu et al., 28 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to ByteLoom.