Stochastic Clock Attention

Updated 22 December 2025

Stochastic Clock Attention is a cross-attention mechanism that uses learned clock processes to generate continuous, monotonic alignments between source and target sequences.
It replaces standard scaled dot-product attention with a Gaussian kernel based on normalized or unnormalized clock integrals, enforcing near-diagonal and causal mappings.
SCA supports both parallel and autoregressive decoding regimes, showing robust performance in text-to-speech and other time-synchronous tasks.

Stochastic Clock Attention (SCA) is a cross-attention mechanism designed for sequence-to-sequence modeling where alignment between continuous, ordered sequences is central. Unlike standard scaled dot-product attention (SDPA) that relies on external positional encodings and lacks guarantees for monotonicity or continuity, SCA formulates attention as the meeting probability of two learned, nonnegative "clock" processes, each parameterizing normalized "time" for the source and target. This approach yields an explicit, probabilistic alignment model with inherent inductive biases for causal, smooth, and near-diagonal mappings—key for frame-synchronous tasks such as text-to-speech (TTS). SCA supports both normalized (parallel) and unnormalized (autoregressive) decoding regimes and acts as a nearly parameter-free drop-in replacement for conventional cross-attention modules (Soh et al., 18 Sep 2025).

1. Mathematical Formulation and Path-Integral Derivation

SCA designates the input sequences $X_s$ ( $s\in[0,S]$ ) and $Y_t$ ( $t\in[0,T]$ ) as source and target, respectively. Both undergo learned feature projections: $\eta^X_s = \mathcal{F}(X_s) \in \mathbb{R}^d, \qquad \eta^Y_t = \mathcal{G}(Y_t) \in \mathbb{R}^d$ A nonnegative rate function $\phi: \mathbb{R}^d \to \mathbb{R}_{>0}$ (e.g., Softplus, $e^x$ ) transforms the projections. In the parallel regime (normalized clocks), cumulative integrals define the normalized clocks: $\lambda^X_s = \frac{\int_0^s \phi(\eta^X_u)\, du}{\int_0^S \phi(\eta^X_u)\, du}, \quad \lambda^Y_t = \frac{\int_0^t \phi(\eta^Y_v)\, dv}{\int_0^T \phi(\eta^Y_v)\, dv}$ with $\lambda^X_s, \lambda^Y_t \in [0,1]$ providing strictly monotonic time reparameterizations. In the autoregressive regime, unnormalized clocks

$\tilde{\lambda}^X_s = \int_0^s \phi(\eta^X_u) du, \quad \tilde{\lambda}^Y_t = \int_0^t \phi(\eta^Y_v) dv$

are used.

Alignment between source and target is modeled by the meeting probability kernel: $K_{\textrm{meet}}(s,t) = \mathbb{E}[\delta(\lambda^X_s - \lambda^Y_t)]$ Assuming the projections are perturbed by zero-mean Gaussian fields, a perturbative expansion yields a Gaussian kernel in clock space: $K_{\textrm{meet}}(s, t \mid \eta^x, \eta^y) = \frac{1}{\sqrt{2\pi \Sigma_{s,t}^2}} \exp \left( -\frac{\Delta_{s,t}^2}{2\,\Sigma_{s,t}^2} \right)$ where $\Delta_{s,t} = \lambda^x_s - \lambda^y_t$ and $\Sigma_{s,t}^2$ aggregates covariances from both clocks. Under stationarity and the delta-method, the variance profile approximates Brownian-bridge behavior: $\Sigma^2_{s, t} \approx \frac{K_X}{\mu_X^2}\left[\frac{s}{S}(1-\frac{s}{S}) + \frac{t}{T}(1-\frac{t}{T})\right]$ For unnormalized clocks, the variance grows linearly (diffusively) with $s$ and $t$ .

The attention score simplifies (with $C$ absorbed by the row-wise softmax): $\mathrm{Score}(s,t) = -\frac{(\lambda^x_s - \lambda^y_t)^2}{2 \Sigma_{s,t}^2} + C$

2. Probabilistic Inductive Biases: Continuity, Monotonicity, and Alignment

SCA's rate function $\phi \geq \varepsilon > 0$ ensures the clocks $\lambda^X_s$ and $\lambda^Y_t$ are strictly monotonic, providing aligned trajectories. The quadratic penalty on $(\lambda^x_s - \lambda^y_t)^2$ intrinsically biases attention toward the diagonal, enforcing smooth, continuous alignments. The Brownian-bridge variance $\Sigma_{s,t}^2$ profile is lowest at endpoints and maximal at the center, leading SCA to favor sharper (more certain) alignments at boundaries and softer, continuous transitions in the mid-sequence. Causal structure is imposed in the AR regime by masking keys with $t > t(s)$ , ensuring causal attention propagation.

3. Scoring Rule and Contrast with Scaled Dot-Product Attention

Conventional SDPA computes scores as

$\textrm{Score}_{\textrm{SDPA}}(s, t) = \frac{(\eta^X_s)^\top \eta^Y_t}{\sqrt{d}}$

SCA, in contrast, defines the score by a Gaussian kernel in clock space: $\textrm{Score}_{\textrm{SCA}}(s, t) = -\frac{\|\lambda^x_s - \lambda^y_t\|_2^2}{2 \Sigma_{s,t}^2}$ This construction entirely replaces the need for positional encodings: temporal alignment is achieved via the learned clock integrals, not through hand-crafted features or sinusoids. SCA introduces few additional parameters, integrating naturally as a plug-and-play module in Transformer-style cross-attention layers.

4. Decoding Regimes and Algorithmic Implementation

SCA supports two decoding paradigms:

Regime	Clock Normalization	Use Case	Notes
Parallel	Normalized	Global length known	Clocks in $[0,1]$ ; full matrix support
Autoregressive	Unnormalized	Left-to-right decoding	Causal mask; increment clocks with history

In parallel mode, the system requires a global length estimate for the target. The attention matrix is computed using normalized clocks and the closed-form score, with softmax applied row-wise.

In the AR regime, SCA operates with unnormalized clocks, incrementally accumulating $\phi(\eta)$ up to the current frame, ensuring past-only (causal) dependencies. Future tokens are masked.

Pseudocode for SCA primitives (Scala-like) is:

function Clock(x: [B×L×d], mask: [B×L], normalize: Bool):
  g ← φ((x[...,:-1]+x[...,1:])/2) * edge_mask
  z0 ← cumsum(g, dim=–1, pad=0)
  if normalize:
    z ← z0 / sum(g, dim=–1, keepdim=true)
    pos ← (cumsum(mask,–1)–0.5)/sum(mask,–1)
    var ← pos*(1–pos)         // Brownian‐bridge
  else:
    z ← z0
    pos ← cumsum(mask,–1)–0.5
    var ← pos                 // diffusive
  return (z*mask, var)

function ClockDiffScore(η_q, η_k, q_mask, k_mask, normalize):
  (λ_q, var_q)=Clock(η_q, q_mask, normalize)
  (λ_k, var_k)=Clock(η_k, k_mask, normalize)
  Σ2=var_q/len_q + (var_k/len_k).T
  dist2=‖λ_q‖² + ‖λ_k‖² – 2·λ_q·λ_k^T
  S= – dist2 / (2·sqrt(d)·Σ2 + ε)
  mask out invalid positions
  return S

5. Practical Implementation: Architecture, Hyperparameters, and Training

SCA was implemented within a 6-layer Transformer encoder and a 4-layer Transformer decoder, both operating at $d_{\text{model}} = 256$ with 4 heads and standard feedforward blocks. Projection matrices $(W_q, W_k)$ in $\mathbb{R}^{256\times d}$ and $(W_v)$ in $\mathbb{R}^{256\times d_v}$ are used. "MaskedTimeNorm" applies per-timestep normalization for stability, adding $\varepsilon = 10^{-5}$ . The rate function is $\phi(x) = \frac{1}{2}[1 + x(1+x+|x|)/(1+|x|)] + \varepsilon$ , $\varepsilon = 10^{-3}$ .

The squared difference in clock space is divided by $\sqrt{d}$ , and a learnable logit scale (initialized to 1.0) modulates the scoring. Training uses AdamW ( $\text{lr}=10^{-4}$ , $\text{weight decay}=10^{-2}$ , batch size 48) for $2\,000$ epochs, optimizing $L_1$ loss on mel-spectrograms. In the parallel regime, the mel-to-phoneme ratio (MPR) is swept from $3.0$ to $10.0$; in AR, $\max$ MPR is set to $7.0$.

6. Experimental Evaluation: Speech Synthesis Performance and Alignment

On the LJSpeech-1.1 corpus (13,100 utterances, 80-dim mel-spectrograms, 22.05 kHz), inference was conducted using a fixed HiFi-GAN vocoder. Evaluations used both Whisper and wav2vec2-CTC automatic speech recognition.

Parallel Decoding at MPR = 6.0:

SDPA: WER = 7.39% $\pm$ 0.22; CER = 3.94% $\pm$ 0.14
SCA (normalized): WER = 7.03% $\pm$ 0.20; CER = 3.66% $\pm$ 0.12

SDPA's performance degrades for MPR outside $[3.0, 10.0]$ (alignment blur, over- or under-generation), while SCA maintains WER $<10\%$ for the full tested range, demonstrating robust speed/rate control.

Autoregressive Decoding:

SDPA yields no coherent alignments (WER, CER $\to 100\%$ ) under causal masks and teacher-forcing.
SCA (unnormalized): WER = 66.5%; CER = 48.5% on 1,852 evaluated ARCTIC+Harvard sentences.

Visual analysis of attention matrices reveals that SCA produces sharper, near-diagonal, and continuous attention, versus drifting or noisy patterns for SDPA. The mid-sequence "softening" observed matches the theoretical Brownian-bridge variance, suggesting a useful inductive bias against over-rigid alignment.

7. Extensions, Applications, and Limitations

SCA generalizes to multi-scale or hierarchical clocks that couple global and local rate modeling and could inform clock-guided diffusion or flow-matching decoders. Its potential extends to continuous sequence alignment tasks beyond audio, including video frame alignment, motion capture, and expressive music performance generation.

Limitations include:

The strict monotonic alignment assumption precludes non-monotonic (large reordering) alignments.
The normalized clock regime requires a known or accurately predicted global target length.
For discrete text generation, global length control may be semantically risky without explicit external modeling.

SCA thus provides an inductively structured, nearly parameter-free attention mechanism that enforces monotonicity and continuity by construction, offers a closed-form Gaussian scoring rule based on learned clock integrals, and improves robustness and stability of sequence alignment compared to conventional SDPA—especially in tasks requiring precise, time-synchronous mappings (Soh et al., 18 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Stochastic Clock Attention for Aligning Continuous and Ordered Sequences (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Stochastic Clock Attention.