Papers
Topics
Authors
Recent
2000 character limit reached

Dual-Clock Denoising

Updated 13 November 2025
  • Dual-clock denoising is a technique that interleaves a strong-guidance clock and a weak-guidance clock to manage region-dependent fidelity in signal restoration.
  • It uses a dual-phase approach in video by blending precise object motion with realistic backgrounds, and in audio via chunked transformer processing for efficient denoising.
  • Empirical benchmarks show improved metrics such as CTD, MSE, FID for video and enhanced PESQ, STOI, SDR for audio compared to traditional single-clock methods.

Dual-clock denoising refers to a structured approach for denoising signals—primarily in video and audio generation—by interleaving two distinct processing schedules ("clocks") with different levels of guidance across spatial or temporal regions. This paradigm enables region-dependent fidelity and flexibility, allowing exact adherence to user instruction in some areas while permitting natural evolution elsewhere. Recent advances demonstrate dual-clock denoising as a training-free motion/appearance control mechanism in video diffusion models (Singer et al., 9 Nov 2025) and as an efficient, explainable transformer operation in audio denoising (Li et al., 2023).

1. Motivation and Conceptual Framework

The impetus for dual-clock denoising arises from limitations in single-clock approaches, notably in diffusion-based editing, where one fixed noise level tt^* yields a trade-off: low noise yields strong adherence (risking static or artifact-laden backgrounds), while high noise encourages realism at the expense of precise control in user-targeted regions. Dual-clock denoising resolves this by deploying two noise schedules:

  • A "strong-guidance clock" (tst_s; small noise) within regions requiring strict fidelity—e.g., the object or trajectory specified by a user mask.
  • A "weak-guidance clock" (twt_w; larger noise) elsewhere, encouraging creative or realistic backgrounds without compromising user-driven control.

This spatially heterogeneous conditioning is foundational to training-free plug-and-play video generation frameworks like Time-to-Move (TTM), and to sequence chunking and attention optimization in transformer-based audio denoising.

2. Mathematical Formulation

2.1 Video Diffusion Dual-Clock Mechanism

Let M{0,1}F×H×WM \in \{0,1\}^{F \times H \times W} designate the binary region mask (“1” for motion-controlled, “0” elsewhere), matched to the latent diffusion grid. Two denoising timesteps are chosen:

  • twt_w: the weak clock (more noise; outside mask)
  • tst_s: the strong clock (less noise; inside mask), with tw>tst_w > t_s

Algorithm steps:

  1. Initialization
    • xtwq(Vw)x_{t_w} \leftarrow q(\cdot|V^{w}) (VwV^{w}: user-warped reference video, noised to level twt_w)
  2. Region-dependent update (for t=twts+1t = t_w \downarrow t_s+1)
    • Predict x^t1=μθ(xt,tI)\hat x_{t-1} = \mu_\theta(x_t, t | I)
    • Overwrite masked pixels:

    xt1=(1M)x^t1+Mxt1wx_{t-1} = (1-M) \odot \hat x_{t-1} + M \odot x^{w}_{t-1}

    where xt1wx^{w}_{t-1} is VwV^{w} noised to t1t-1

  3. Unified denoising (for t=ts1t = t_s \downarrow 1)

    • Standard diffusion step: xt1μθ(xt,tI)x_{t-1} \leftarrow \mu_\theta(x_t, t | I)

This yields precise adherence in MM for twts+1t_w \rightarrow t_s+1, followed by conventional video dynamics everywhere from tst_s onward.

2.2 Audio Dual-Clock via Chunked Transformer Processing

DPATD (Dual-Phase Audio Transformer for Denoising) interleaves a fast local clock over short chunks and a slow global clock over chunk sequence embeddings:

  • Local-chunk phase ("fast clock"): parallel self-attention over short, fixed-length audio chunks of size KK
  • Global-chunk phase ("slow clock"): attention over the sequence of chunk representations (ML/KM \approx L/K for input of LL samples)

Pseudo-formulation: for z=1Z:{Uz=LocalChunkTransformer(Tz) Tz+1=GlobalChunkTransformer(Uz)\text{for }z = 1\ldots Z: \begin{cases} U_z = \text{LocalChunkTransformer}(T_z) \ T_{z+1} = \text{GlobalChunkTransformer}(U_z) \end{cases} where both modules apply memory-compressed explainable multi-head attention.

3. Algorithmic Realization

Table: Dual-Clock Denoising in TTM (Video)

Step Input Operation
Initialization Noised reference xtwx_{t_w} Set state for high-noise regions
Region-dependent MM, x^t1\hat x_{t-1}, xt1wx^{w}_{t-1} Masked blending
Unified denoising Final xtsx_{t_s} to x0x_0 Standard denoising

Key considerations:

  • No model weights update
  • The mask MM is user-defined (via cut-and-drag, depth warping)
  • Precomputation is a single noise injection and look-up (xtwx^{w}_{t}, all tt)
  • No iterative mask loops (contrasts with RePaint-style approaches)

In DPATD (audio), both clocks operate via chunked self-attention and memory compression, reducing complexity from O(L2)O(L^2) to O(L)O(L).

4. Parameterization and Practical Choices

  • Video timesteps: Typical settings are tw=36t_w=36, ts=25t_s=25 (for 50-step schedule on SVD) and tw=46t_w=46, ts=41t_s=41 (CogVideoX).
  • Masking: Hard masks (M{0,1}M \in \{0,1\}) are default; soft masks (M[0,1]M \in [0,1]) are supported by extending the blending formula.
  • Backbone compatibility: Any image-to-video diffusion model with denoiser μθ(xt,tI)\mu_\theta(x_t, t | I) (U-Net or transformer).
  • Compute: Minimal overhead—one supplemental forward pass per denoising step for reference state; no additional training or runtime cost.
  • Chunking for DPATD: Chunk size chosen as K5LK \approx \lceil \sqrt{5L} \rceil balances parallelism and memory, with K=1000K=1000 empirically optimal for speech.

5. Empirical Validation and Numerical Benchmarks

Video (TTM dual-clock denoising)

Benchmark Metric TTM Best Prior Baseline
MC-Bench (SVD) CoTracker Distance (CTD) 7.97 MotionPro: 8.69
MC-Bench CTD (CogVideoX) 13.67 GWTF: 27.84–32.55
DL3DV Pixel-MSE 0.022 GWTF: 0.033
DL3DV FID 21.97 GWTF: 25.99
DL3DV Optical-flow MSE 60.56 GWTF: 76.71

Ablations demonstrate:

  • Single-clock at twt_w yields CTD>>27 (object drift)
  • Single-clock at tst_s yields static videos (frozen background)
  • RePaint-style masking yields best CTD (\to2.95) but poor imaging quality
  • Dual-clock (twt_w=36, tst_s=25): CTD 7.97, dynamic degree 0.427, imaging 0.617—balancing fidelity and realism

Audio (DPATD dual-clock chunking)

Dataset Metric DPATD Best Prior T-domain Model
VoiceBank–DEMAND PESQ 3.55 3.41–3.44
VoiceBank–DEMAND STOI 0.97 0.96
BirdSoundsDenoising SDR (dB) 10.49 10.33

Empirically:

  • Fast convergence (memory-compressed attention reduces out-of-memory events)
  • Dual-phase blocks and memory-compressed explainable attention incrementally improve performance (best with 12 attention heads; chunk size K=1000K=1000)

6. Implications, Limitations, and Extensions

Dual-clock denoising fundamentally realizes spatially or temporally varying conditioning strength, using a mask to distribute guidance in video pixels or audio segments. For video, this enables precise object motion while yielding dynamic, artifact-free backgrounds, circumventing the need for fine-tuning or iterative region refinement. In audio, chunk-based local/global clocks yield improved modeling of both fine-scale and long-term dependencies, lowering complexity to O(L)O(L) and enabling real-time deployment by streaming frames.

A plausible implication is that dual-clock scheduling may generalize to other generative domains where precise regional control and global dynamics must be harmonized—potentially visual editing, structured prediction, or multimodal reconstruction. It should be noted, however, that success depends on well-chosen mask regions and noise schedules; suboptimal parameterization can degrade either fidelity or realism.

7. Context and Relation to Broader Literature

Time-to-Move (TTM) establishes dual-clock denoising as a training-free, plug-and-play mechanism for motion and appearance control in video diffusion models (Singer et al., 9 Nov 2025), superseding the need for model-specific fine-tuning and outperforming previous training-based methods in key motion benchmarks. DPATD leverages a related dual-clock chunking paradigm for efficient audio denoising, demonstrating that chunkwise fast clocks and global slow clocks facilitate scalable transformer inference and superior denoising quality (Li et al., 2023).

These results consolidate dual-clock denoising as a principled methodology for region-dependent signal restoration, with empirical validation substantiating efficiency and improved controllability across disparate generative modalities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Dual-Clock Denoising.