Anchor-Based View-Aware Motion Embedding

Updated 20 November 2025

The paper introduces an anchor-based embedding mechanism that leverages a fixed set of view-specific anchor vectors to achieve efficient and consistent motion representation across different viewpoints.
It employs spherical linear interpolation and RoPE-based techniques to guide smooth motion embedding transitions and enable robust cross-view synthesis.
Empirical results demonstrate improved novel-view synthesis accuracy, faster convergence, and reduced computational overhead compared to traditional per-view embedding methods.

Anchor-based view-aware motion embedding refers to a family of techniques for encoding, manipulating, and transferring motion information in video or 3D visual data, optimized for cross-view consistency and efficient representation. The approach leverages a discrete set of learned "anchor" embeddings tied to particular spatial viewpoints or object locations and employs view-sensitive interpolation mechanisms to generate motion representations for arbitrary or unseen views. These embeddings serve as persistent, addressable motion cues within larger motion synthesis or reconstruction frameworks, offering dramatic computational and generalization advantages over traditional per-view or per-instance codes.

1. Motivation and Background

Conventional motion embedding strategies, particularly in the context of multi-view video analysis and generation, often use separate learnable codes for each camera angle or object view. In per-view condition embedding, the number of codes, and hence storage and computation costs, scale linearly with the number of views $N$ , leading to inefficiency and poor generalization. Moreover, such codes are typically optimized view-specifically; this results in poor transferability—motion representations learned for one viewpoint tend to be inconsistent or unstable when applied to nearby or novel views, leading to artifacts or slow convergence in downstream tasks such as motion transfer or video generation (Bekor et al., 18 Nov 2025).

Anchor-based view-aware motion embedding addresses these limitations via (1) the introduction of a fixed, small set of view- or region-tied anchor embeddings, and (2) view- or location-aware interpolation, ensuring that motion representations are smoothly shared and coupled across neighboring viewpoints or spatial addresses. This approach is central to recent advances in semantic 3D motion transfer from multiview video (Gaussian See, Gaussian Do) and spatiotemporally coherent video synthesis (STANCE) (Bekor et al., 18 Nov 2025, Chen et al., 16 Oct 2025).

2. Formal Mechanisms: Mathematical Formulation and Interpolation

At the core, the anchor-based embedding scheme introduces $K \ll N$ learnable motion anchor vectors $\{\boldsymbol{a}_k\}$ , each associated with a fixed azimuth or spatial coordinate $\phi_k$ (for instance, sampled uniformly on $[0, 2\pi)$ in the case of view angles):

$\{\boldsymbol{a}_k\}_{k=1}^K, \quad \boldsymbol{a}_k\in\mathbb{R}^{M\times (F{+}1)\times d}$

To obtain a motion code for an arbitrary view or camera pose $\phi$ , the two nearest anchors $(i, j)$ are identified (with $\phi_i \leq \phi < \phi_j$ ). The embedding is produced via spherical linear interpolation (slerp):

$\mathrm{E}(\phi; \{\boldsymbol{a}_k\}) = \mathrm{slerp}(\boldsymbol{a}_i, \boldsymbol{a}_j; \alpha)$

where

$\alpha = \frac{\phi - \phi_i}{\phi_j - \phi_i}$

This slerp interpolation ensures that motion embeddings shift smoothly and continuously as a function of viewpoint, enforcing cross-view consistency and latent code sharing. During the inversion or training process, anchor embeddings are optimized by minimizing the reconstruction error through a frozen denoising diffusion model $D_\theta$ , using the interpolated embedding as a motion condition (Bekor et al., 18 Nov 2025).

A related approach in video generation via STANCE uses spatially anchored tokens: control points identified from the first frame are tagged with static rotary positional embeddings (RoPE) based on their position, making these tokens persistent and spatially addressable for all subsequent frames. The transformer backbone attends to these anchors at each layer and time step, enabling precise injection and tracking of motion cues in latent space (Chen et al., 16 Oct 2025).

3. Integration Into Motion Transfer and Video Synthesis Pipelines

The anchor-based view-aware embedding mechanism is integrated into multi-stage pipelines, with distinct stages for motion inversion, supervised synthesis, and high-fidelity 4D reconstruction. In Gaussian See, Gaussian Do (Bekor et al., 18 Nov 2025), the schematic workflow is as follows:

Structured Multiview Motion Inversion: Anchor embeddings are learned from source multiview videos by minimizing reconstruction loss under a frozen diffusion model. Each update step interpolates the motion embedding for a randomly selected view and jointly updates the two nearest anchors.
View-aware Supervision Video Generation: For each target or supervision viewpoint, the static target shape is rendered and motion code is produced via anchor interpolation. The diffusion denoiser synthesizes motion videos conditioned on this embedding.
4D Gaussian Splatting Consolidation: The synthetic videos supervise a deformation module that animates a dynamic scene via MLP-based control of 3D Gaussian splats, trained to match the appearance (via LPIPS) and preserve local structure (via ARAP loss).

In the STANCE pipeline (Chen et al., 16 Oct 2025), anchor tokens are defined via foreground instance masks and 2D/2.5D instance cues, with RoPE ensuring their addressability and persistence. These anchors inject motion information into all transformer layers, biasing attention, and their influence is regularized by requiring joint prediction of both RGB appearance and auxiliary structural signals (depth or segmentation).

4. View-Awareness, Cross-View Consistency, and Embedding Efficiency

The anchor-based approach fundamentally ties learnable motion codes to view or spatial anchors, in effect constructing a lower-dimensional bank of motion tokens that can adapt to arbitrary or previously unseen views through continuous interpolation (slerp or RoPE-based mixing). Unlike per-view representations, this design creates robust "bridges" between adjacent viewpoints, enforcing smoothness and strong generalization.

Empirically, this results in several notable effects:

Cross-view Consistency: As shown in [(Bekor et al., 18 Nov 2025), Table 2], anchor-based embeddings achieve significantly lower MSE and LPIPS scores on novel-view synthesis ( $\approx$ 0.0028 MSE, 0.040 LPIPS) than per-view or global codes.
Convergence Speed: Due to each optimization step updating multiple anchors (via the overlapping interpolation), learning converges notably faster with $K\ll N$ anchors (optimal at $K=5$ for their setup).
Embedding Efficiency: The approach enables high-quality motion representation with dramatically fewer learned tokens compared to the number of views.

STANCE extends this anchor concept to the context of spatio-temporal transformers by anchoring tokens on detected instances with persistent positional embeddings. This grants the model reliable access to explicit motion control cues across all frames (Chen et al., 16 Oct 2025).

5. Empirical Results, Ablations, and Benchmarks

Anchor-based view-aware motion embedding underpins demonstrable advances in both 3D motion transfer and video generation:

Motion Fidelity and Structural Consistency: In the benchmarks established by (Bekor et al., 18 Nov 2025), anchor-based embeddings outperform standard and recent baselines (including per-view interpolation and "global" codes), achieving up to +0.13 motion fidelity improvement and 0.02–0.05 higher CLIP/I-similarity metrics.
Novel-View Synthesis: Performance on unseen views is markedly improved; approaches lacking view-aware anchors suffer order-of-magnitude higher error and artifact rates.
Human Study: Observers consistently prefer reconstructions produced using anchor-based embeddings for motion plausibility and appearance.
Effect of $K$ : Ablations indicate a unimodal trade-off in performance versus anchor count; too few ( $K=1$ ) preclude sufficient flexibility, while too many ( $K\geq8$ ) reduce cross-view sharing and slow convergence.

These results validate the core claim that anchor-based, view-aware embedding serves as an effective mechanism—or "glue"—for lifting 2D implicit inversion methods into robust 3D and 4D settings, as well as for establishing cross-frame temporal coherence in video transformers (Bekor et al., 18 Nov 2025, Chen et al., 16 Oct 2025).

6. Comparative Approaches and Extensions

Anchor-based view-aware motion embeddings differ from traditional per-view codes, spatially agnostic embeddings, and naive token spreading by offering (1) interpolation-based sharing across views, (2) explicit spatial or angular addressability, and (3) efficiency via dimensionality reduction. In STANCE, this is further extended to simultaneous appearance-structure optimization by using joint RGB and auxiliary-map prediction losses, making the anchor tokens strong geometrical witnesses rather than mere appearance guiders (Chen et al., 16 Oct 2025).

A plausible implication is that similar anchor-based mechanisms could be adopted in other generative or reconstructive frameworks, including text-to-video, 4D scene synthesis, or neural rendering at scale, wherever cross-view or cross-instance consistency is required. Future work may investigate optimal layouts of anchors, combinations with explicit smoothness regularization (e.g., $\mathcal{L}_{view}$ ), or task-specific strategies for anchor selection and interpolation.

7. Impact and Future Directions

Anchor-based view-aware motion embedding represents a key architectural innovation for scalable, generalizable, and cross-view consistent motion representation in 3D and video synthesis pipelines. Its adoption eliminates several bottlenecks of traditional per-view approaches, including computation, memory, and generalization to novel views. Empirical validation in recent literature demonstrates state-of-the-art performance in semantic 3D motion transfer and temporally coherent video generation benchmarks.

Potential future directions include anchor learning for more complex viewing/manipulation manifolds (e.g., non-circular camera paths, spatially dynamic anchors), joint optimization with explicit global scene priors, and extensions into hierarchical or multi-scale anchor organizations for even broader applicability across multi-view, multi-object, or spatio-temporal generative tasks (Bekor et al., 18 Nov 2025, Chen et al., 16 Oct 2025).

PDF Markdown Chat (Pro)

References (2)

Gaussian See, Gaussian Do: Semantic 3D Motion Transfer from Multiview Video (2025)

STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding (2025)

Follow Topic

Get notified by email when new papers are published related to Anchor-Based View-Aware Motion Embedding.