Identifying conditioning strategies that best preserve hand fidelity, realism, and temporal coherence

Determine which conditioning strategies for video diffusion models best preserve hand fidelity, realism, and temporal coherence when conditioning on tracked joint-level hand poses.

Background

Even if joint-level hand poses can be injected into video diffusion models, different conditioning mechanisms may vary in their ability to maintain visual quality and temporal stability.

The paper notes uncertainty over which strategy (e.g., token concatenation, addition, cross-attention, ControlNet-style, adaptive layer normalization) is most effective for preserving realistic and temporally coherent hand appearances.

References

Furthermore, it is unclear which conditioning strategies best preserve hand fidelity, realism, and temporal coherence in video generation.