Identifying conditioning strategies that best preserve hand fidelity, realism, and temporal coherence

Determine which conditioning strategies for video diffusion models best preserve hand fidelity, realism, and temporal coherence when conditioning on tracked joint-level hand poses.

Background

Even if joint-level hand poses can be injected into video diffusion models, different conditioning mechanisms may vary in their ability to maintain visual quality and temporal stability.

The paper notes uncertainty over which strategy (e.g., token concatenation, addition, cross-attention, ControlNet-style, adaptive layer normalization) is most effective for preserving realistic and temporally coherent hand appearances.

References

Furthermore, it is unclear which conditioning strategies best preserve hand fidelity, realism, and temporal coherence in video generation.

— Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control (2602.18422 - Xie et al., 20 Feb 2026) in Section 1: Introduction

Identifying conditioning strategies that best preserve hand fidelity, realism, and temporal coherence

Background

References

Related Problems