Papers
Topics
Authors
Recent
Search
2000 character limit reached

Anchor-Heavy Identity Sinks

Updated 30 December 2025
  • Anchor-Heavy Identity Sinks are a video memory management technique that divides the key-value cache into immutable sink tokens and mutable tokens for consistent identity preservation.
  • The approach strategically biases attention towards early, high-fidelity sink tokens, ensuring that core visual features like skin tone and geometry remain stable during multi-turn video generation.
  • Empirical results in systems like LiveTalk show that AHIS significantly improves visual fidelity and stability, maintaining avatar identity over prolonged interactive sessions.

Anchor-Heavy Identity Sinks (AHIS) are a video memory management and attention-biasing technique used in real-time multimodal avatar synthesis to prevent identity drift, maintain consistent visual appearance, and ensure stability across long-horizon, multi-turn video generation. Designed to address color, geometry, and context distortions that emerge in autoregressive, blockwise video diffusion frameworks, AHIS strategically partitions the key-value (KV) memory cache within the performer's attention window into unchanging "sink" tokens and mutable short-term context tokens. This mechanism enables coherent spatiotemporal avatar identity anchoring without sacrificing real-time inference or throughput, and is specifically implemented in systems such as LiveTalk for rapid, conversational interactive video diffusion (Chern et al., 29 Dec 2025).

1. Motivation and Definition

Diffusion models for video and avatar generation often suffer from cumulative errors and instability over extended streaming or multi-turn tasks. In blockwise autoregressive architectures, the KV cache stores a limited window of latent features that serve as attention keys and values for subsequent blocks. Without explicit anchoring, visual identity—such as skin tone, facial geometry, and background details—may gradually drift, manifesting as color inconsistencies, geometric distortion, or artefactual transitions. "Anchor-Heavy Identity Sinks" are designed to mitigate this instability by designating a subset of the KV cache as immutable "sink" tokens, which represent early, high-fidelity identity frames. Attention mechanisms then bias computation towards these sinks, stabilizing long-term avatar appearance and providing a consistent reference beyond the current short block window (Chern et al., 29 Dec 2025).

2. Technical Implementation in LiveTalk

In LiveTalk, the attention cache is divided into two distinct segments for every fixed window of 5 blocks:

  • Sink tokens: The first 3 blocks after initialization, containing early, high-quality latent encodings of visual identity. These blocks are never updated or evicted throughout the session.
  • Rolling tokens: The last 2 blocks; these shift as new blocks are generated, capturing immediate dialog context and motion features.

At each autoregressive block, the performer's transformer attends to both sinks and rolling tokens, with explicit bias that favors the former for identity-relevant computations. This approach is operationalized as:

  • Sinks: Permanent anchors for key aspects of visual identity.
  • Rolling: Dynamic context for conversational progression and motion.

This partitioning ensures that regardless of temporal length or conversational complexity, the avatar consistently maintains critical appearance features anchored by the sinks. Empirical demonstration in LiveTalk shows that avatars maintain consistent haircut and lighting over 50-second multi-turn sessions with no visual drift (Chern et al., 29 Dec 2025).

3. Theoretical Rationale and Operational Mechanics

The theoretical justification for AHIS arises from the memory-constrained autoregressive sampling dynamics of video diffusion. When the KV cache exclusively tracks the last T blocks, recent context dominates, but identity information encoded in early frames is rapidly overwritten. AHIS solves this by:

  • Preventing anchor overwrite: Sink tokens are strictly not updated or evicted within the cache schedule.
  • Biasing attention scores: Sink tokens are upweighted in attention computations for appearance and identity features.
  • Denoising integration: At each block, the denoising transformer uses both anchor and rolling tokens, enabling stable updates while preserving identity context.

A plausible implication is that by breaking the temporal symmetry of the cache, the system maintains a stable "identity manifold" in latent space, against which recent context is flexibly synthesized for natural interaction.

4. Impact on System Stability and Visual Fidelity

Quantitative and qualitative evaluation suggest that AHIS substantially improves video coherence and long-horizon stability in multimodal interactive avatar generation:

  • Multi-turn video coherence: On curated multi-round benchmarks, LiveTalk achieves multi-video coherence scores (MVC) of 87.3%, compared to 26.7% for Veo3 and poor completeness for Sora2 under otherwise equivalent conditions (Chern et al., 29 Dec 2025).
  • Content quality: Human evaluators report consistently higher information completeness and overall experience, while first-frame and per-turn latencies remain in the sub-second range.
  • Prevents drift: Visual identities, including hair, lighting, and facial structure, remain stable throughout extended dialog, even as conversational content shifts.

This suggests that Anchor-Heavy Identity Sinks are essential for robust, visually coherent multimodal video generation at conversational speeds.

5. Relation to Other Identity and Memory Techniques

AHIS differs from simple fixed-term memory or recurrent architectures by combining immutable visual anchors with mutable short-term memory. Unlike bidirectional attention mechanisms (e.g., SoulX-LiveTalk's self-correcting bidirectional distillation (Shen et al., 29 Dec 2025)), which preserve motion coherence within chunks, AHIS explicitly segments long-horizon memory, focusing on identity persistence over conversational context.

A plausible implication is that this hybrid memory strategy could be further generalized to other tasks involving identity preservation under continuous, real-time generation—such as 3D avatar synthesis or AR presentations—where cumulative context drift is a persistent challenge.

6. Limitations and Future Directions

The fixed ratio of sink to rolling tokens constrains the window for contextual adaptation; extending AHIS to variable-length sinks or dynamically refreshed anchors may improve flexibility for complex scenes or multi-speaker interactions. Future work identified includes reducing memory footprint for mobile deployment and extending anchors to encode pose or full-body appearance. Applications beyond multimodal avatars, such as AR storytelling (RealityTalk (Liao et al., 2022)), may also benefit from anchor-heavy cache management, though implementation details remain subject to empirical validation.

Overall, Anchor-Heavy Identity Sinks provide a critical mechanism for high-fidelity, temporally stable identity maintenance in streaming video diffusion, enabling real-time, interactive multimodal systems to deliver consistent digital human experiences across extended conversational or creative tasks (Chern et al., 29 Dec 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Anchor-Heavy Identity Sinks.