Dual-Clock Denoising

Updated 13 November 2025

Dual-clock denoising is a technique that interleaves a strong-guidance clock and a weak-guidance clock to manage region-dependent fidelity in signal restoration.
It uses a dual-phase approach in video by blending precise object motion with realistic backgrounds, and in audio via chunked transformer processing for efficient denoising.
Empirical benchmarks show improved metrics such as CTD, MSE, FID for video and enhanced PESQ, STOI, SDR for audio compared to traditional single-clock methods.

Dual-clock denoising refers to a structured approach for denoising signals—primarily in video and audio generation—by interleaving two distinct processing schedules ("clocks") with different levels of guidance across spatial or temporal regions. This paradigm enables region-dependent fidelity and flexibility, allowing exact adherence to user instruction in some areas while permitting natural evolution elsewhere. Recent advances demonstrate dual-clock denoising as a training-free motion/appearance control mechanism in video diffusion models (Singer et al., 9 Nov 2025) and as an efficient, explainable transformer operation in audio denoising (Li et al., 2023).

1. Motivation and Conceptual Framework

The impetus for dual-clock denoising arises from limitations in single-clock approaches, notably in diffusion-based editing, where one fixed noise level $t^*$ yields a trade-off: low noise yields strong adherence (risking static or artifact-laden backgrounds), while high noise encourages realism at the expense of precise control in user-targeted regions. Dual-clock denoising resolves this by deploying two noise schedules:

A "strong-guidance clock" ( $t_s$ ; small noise) within regions requiring strict fidelity—e.g., the object or trajectory specified by a user mask.
A "weak-guidance clock" ( $t_w$ ; larger noise) elsewhere, encouraging creative or realistic backgrounds without compromising user-driven control.

This spatially heterogeneous conditioning is foundational to training-free plug-and-play video generation frameworks like Time-to-Move (TTM), and to sequence chunking and attention optimization in transformer-based audio denoising.

2. Mathematical Formulation

2.1 Video Diffusion Dual-Clock Mechanism

Let $M \in \{0,1\}^{F \times H \times W}$ designate the binary region mask (“1” for motion-controlled, “0” elsewhere), matched to the latent diffusion grid. Two denoising timesteps are chosen:

$t_w$ : the weak clock (more noise; outside mask)
$t_s$ : the strong clock (less noise; inside mask), with $t_w > t_s$

Algorithm steps:

Initialization
- $x_{t_w} \leftarrow q(\cdot|V^{w})$ ( $V^{w}$ : user-warped reference video, noised to level $t_w$ )
Region-dependent update (for $t = t_w \downarrow t_s+1$ )
- Predict $\hat x_{t-1} = \mu_\theta(x_t, t | I)$
- Overwrite masked pixels:
$x_{t-1} = (1-M) \odot \hat x_{t-1} + M \odot x^{w}_{t-1}$

where $x^{w}_{t-1}$ is $V^{w}$ noised to $t-1$
Unified denoising (for $t = t_s \downarrow 1$ )
- Standard diffusion step: $x_{t-1} \leftarrow \mu_\theta(x_t, t | I)$

This yields precise adherence in $M$ for $t_w \rightarrow t_s+1$ , followed by conventional video dynamics everywhere from $t_s$ onward.

2.2 Audio Dual-Clock via Chunked Transformer Processing

DPATD (Dual-Phase Audio Transformer for Denoising) interleaves a fast local clock over short chunks and a slow global clock over chunk sequence embeddings:

Local-chunk phase ("fast clock"): parallel self-attention over short, fixed-length audio chunks of size $K$
Global-chunk phase ("slow clock"): attention over the sequence of chunk representations ( $M \approx L/K$ for input of $L$ samples)

Pseudo-formulation: $\text{for }z = 1\ldots Z: \begin{cases} U_z = \text{LocalChunkTransformer}(T_z) \ T_{z+1} = \text{GlobalChunkTransformer}(U_z) \end{cases}$ where both modules apply memory-compressed explainable multi-head attention.

3. Algorithmic Realization

Table: Dual-Clock Denoising in TTM (Video)

Step	Input	Operation
Initialization	Noised reference $x_{t_w}$	Set state for high-noise regions
Region-dependent	$M$ , $\hat x_{t-1}$ , $x^{w}_{t-1}$	Masked blending
Unified denoising	Final $x_{t_s}$ to $x_0$	Standard denoising

Key considerations:

No model weights update
The mask $M$ is user-defined (via cut-and-drag, depth warping)
Precomputation is a single noise injection and look-up ( $x^{w}_{t}$ , all $t$ )
No iterative mask loops (contrasts with RePaint-style approaches)

In DPATD (audio), both clocks operate via chunked self-attention and memory compression, reducing complexity from $O(L^2)$ to $O(L)$ .

4. Parameterization and Practical Choices

Video timesteps: Typical settings are $t_w=36$ , $t_s=25$ (for 50-step schedule on SVD) and $t_w=46$ , $t_s=41$ (CogVideoX).
Masking: Hard masks ( $M \in \{0,1\}$ ) are default; soft masks ( $M \in [0,1]$ ) are supported by extending the blending formula.
Backbone compatibility: Any image-to-video diffusion model with denoiser $\mu_\theta(x_t, t | I)$ (U-Net or transformer).
Compute: Minimal overhead—one supplemental forward pass per denoising step for reference state; no additional training or runtime cost.
Chunking for DPATD: Chunk size chosen as $K \approx \lceil \sqrt{5L} \rceil$ balances parallelism and memory, with $K=1000$ empirically optimal for speech.

5. Empirical Validation and Numerical Benchmarks

Video (TTM dual-clock denoising)

Benchmark	Metric	TTM	Best Prior Baseline
MC-Bench (SVD)	CoTracker Distance (CTD)	7.97	MotionPro: 8.69
MC-Bench	CTD (CogVideoX)	13.67	GWTF: 27.84–32.55
DL3DV	Pixel-MSE	0.022	GWTF: 0.033
DL3DV	FID	21.97	GWTF: 25.99
DL3DV	Optical-flow MSE	60.56	GWTF: 76.71

Ablations demonstrate:

Single-clock at $t_w$ yields CTD $>$ 27 (object drift)
Single-clock at $t_s$ yields static videos (frozen background)
RePaint-style masking yields best CTD ( $\to$ 2.95) but poor imaging quality
Dual-clock ( $t_w$ =36, $t_s$ =25): CTD 7.97, dynamic degree 0.427, imaging 0.617—balancing fidelity and realism

Audio (DPATD dual-clock chunking)

Dataset	Metric	DPATD	Best Prior T-domain Model
VoiceBank–DEMAND	PESQ	3.55	3.41–3.44
VoiceBank–DEMAND	STOI	0.97	0.96
BirdSoundsDenoising	SDR (dB)	10.49	10.33

Empirically:

Fast convergence (memory-compressed attention reduces out-of-memory events)
Dual-phase blocks and memory-compressed explainable attention incrementally improve performance (best with 12 attention heads; chunk size $K=1000$ )

6. Implications, Limitations, and Extensions

Dual-clock denoising fundamentally realizes spatially or temporally varying conditioning strength, using a mask to distribute guidance in video pixels or audio segments. For video, this enables precise object motion while yielding dynamic, artifact-free backgrounds, circumventing the need for fine-tuning or iterative region refinement. In audio, chunk-based local/global clocks yield improved modeling of both fine-scale and long-term dependencies, lowering complexity to $O(L)$ and enabling real-time deployment by streaming frames.

A plausible implication is that dual-clock scheduling may generalize to other generative domains where precise regional control and global dynamics must be harmonized—potentially visual editing, structured prediction, or multimodal reconstruction. It should be noted, however, that success depends on well-chosen mask regions and noise schedules; suboptimal parameterization can degrade either fidelity or realism.

7. Context and Relation to Broader Literature

Time-to-Move (TTM) establishes dual-clock denoising as a training-free, plug-and-play mechanism for motion and appearance control in video diffusion models (Singer et al., 9 Nov 2025), superseding the need for model-specific fine-tuning and outperforming previous training-based methods in key motion benchmarks. DPATD leverages a related dual-clock chunking paradigm for efficient audio denoising, demonstrating that chunkwise fast clocks and global slow clocks facilitate scalable transformer inference and superior denoising quality (Li et al., 2023).

These results consolidate dual-clock denoising as a principled methodology for region-dependent signal restoration, with empirical validation substantiating efficiency and improved controllability across disparate generative modalities.

PDF Markdown Chat (Pro)

References (2)

Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising (2025)

DPATD: Dual-Phase Audio Transformer for Denoising (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Dual-Clock Denoising.