Papers
Topics
Authors
Recent
2000 character limit reached

Stochastic Clock Attention

Updated 22 December 2025
  • Stochastic Clock Attention is a cross-attention mechanism that uses learned clock processes to generate continuous, monotonic alignments between source and target sequences.
  • It replaces standard scaled dot-product attention with a Gaussian kernel based on normalized or unnormalized clock integrals, enforcing near-diagonal and causal mappings.
  • SCA supports both parallel and autoregressive decoding regimes, showing robust performance in text-to-speech and other time-synchronous tasks.

Stochastic Clock Attention (SCA) is a cross-attention mechanism designed for sequence-to-sequence modeling where alignment between continuous, ordered sequences is central. Unlike standard scaled dot-product attention (SDPA) that relies on external positional encodings and lacks guarantees for monotonicity or continuity, SCA formulates attention as the meeting probability of two learned, nonnegative "clock" processes, each parameterizing normalized "time" for the source and target. This approach yields an explicit, probabilistic alignment model with inherent inductive biases for causal, smooth, and near-diagonal mappings—key for frame-synchronous tasks such as text-to-speech (TTS). SCA supports both normalized (parallel) and unnormalized (autoregressive) decoding regimes and acts as a nearly parameter-free drop-in replacement for conventional cross-attention modules (Soh et al., 18 Sep 2025).

1. Mathematical Formulation and Path-Integral Derivation

SCA designates the input sequences XsX_s (s[0,S]s\in[0,S]) and YtY_t (t[0,T]t\in[0,T]) as source and target, respectively. Both undergo learned feature projections: ηsX=F(Xs)Rd,ηtY=G(Yt)Rd\eta^X_s = \mathcal{F}(X_s) \in \mathbb{R}^d, \qquad \eta^Y_t = \mathcal{G}(Y_t) \in \mathbb{R}^d A nonnegative rate function ϕ:RdR>0\phi: \mathbb{R}^d \to \mathbb{R}_{>0} (e.g., Softplus, exe^x) transforms the projections. In the parallel regime (normalized clocks), cumulative integrals define the normalized clocks: λsX=0sϕ(ηuX)du0Sϕ(ηuX)du,λtY=0tϕ(ηvY)dv0Tϕ(ηvY)dv\lambda^X_s = \frac{\int_0^s \phi(\eta^X_u)\, du}{\int_0^S \phi(\eta^X_u)\, du}, \quad \lambda^Y_t = \frac{\int_0^t \phi(\eta^Y_v)\, dv}{\int_0^T \phi(\eta^Y_v)\, dv} with λsX,λtY[0,1]\lambda^X_s, \lambda^Y_t \in [0,1] providing strictly monotonic time reparameterizations. In the autoregressive regime, unnormalized clocks

λ~sX=0sϕ(ηuX)du,λ~tY=0tϕ(ηvY)dv\tilde{\lambda}^X_s = \int_0^s \phi(\eta^X_u) du, \quad \tilde{\lambda}^Y_t = \int_0^t \phi(\eta^Y_v) dv

are used.

Alignment between source and target is modeled by the meeting probability kernel: Kmeet(s,t)=E[δ(λsXλtY)]K_{\textrm{meet}}(s,t) = \mathbb{E}[\delta(\lambda^X_s - \lambda^Y_t)] Assuming the projections are perturbed by zero-mean Gaussian fields, a perturbative expansion yields a Gaussian kernel in clock space: Kmeet(s,tηx,ηy)=12πΣs,t2exp(Δs,t22Σs,t2)K_{\textrm{meet}}(s, t \mid \eta^x, \eta^y) = \frac{1}{\sqrt{2\pi \Sigma_{s,t}^2}} \exp \left( -\frac{\Delta_{s,t}^2}{2\,\Sigma_{s,t}^2} \right) where Δs,t=λsxλty\Delta_{s,t} = \lambda^x_s - \lambda^y_t and Σs,t2\Sigma_{s,t}^2 aggregates covariances from both clocks. Under stationarity and the delta-method, the variance profile approximates Brownian-bridge behavior: Σs,t2KXμX2[sS(1sS)+tT(1tT)]\Sigma^2_{s, t} \approx \frac{K_X}{\mu_X^2}\left[\frac{s}{S}(1-\frac{s}{S}) + \frac{t}{T}(1-\frac{t}{T})\right] For unnormalized clocks, the variance grows linearly (diffusively) with ss and tt.

The attention score simplifies (with CC absorbed by the row-wise softmax): Score(s,t)=(λsxλty)22Σs,t2+C\mathrm{Score}(s,t) = -\frac{(\lambda^x_s - \lambda^y_t)^2}{2 \Sigma_{s,t}^2} + C

2. Probabilistic Inductive Biases: Continuity, Monotonicity, and Alignment

SCA's rate function ϕε>0\phi \geq \varepsilon > 0 ensures the clocks λsX\lambda^X_s and λtY\lambda^Y_t are strictly monotonic, providing aligned trajectories. The quadratic penalty on (λsxλty)2(\lambda^x_s - \lambda^y_t)^2 intrinsically biases attention toward the diagonal, enforcing smooth, continuous alignments. The Brownian-bridge variance Σs,t2\Sigma_{s,t}^2 profile is lowest at endpoints and maximal at the center, leading SCA to favor sharper (more certain) alignments at boundaries and softer, continuous transitions in the mid-sequence. Causal structure is imposed in the AR regime by masking keys with t>t(s)t > t(s), ensuring causal attention propagation.

3. Scoring Rule and Contrast with Scaled Dot-Product Attention

Conventional SDPA computes scores as

ScoreSDPA(s,t)=(ηsX)ηtYd\textrm{Score}_{\textrm{SDPA}}(s, t) = \frac{(\eta^X_s)^\top \eta^Y_t}{\sqrt{d}}

SCA, in contrast, defines the score by a Gaussian kernel in clock space: ScoreSCA(s,t)=λsxλty222Σs,t2\textrm{Score}_{\textrm{SCA}}(s, t) = -\frac{\|\lambda^x_s - \lambda^y_t\|_2^2}{2 \Sigma_{s,t}^2} This construction entirely replaces the need for positional encodings: temporal alignment is achieved via the learned clock integrals, not through hand-crafted features or sinusoids. SCA introduces few additional parameters, integrating naturally as a plug-and-play module in Transformer-style cross-attention layers.

4. Decoding Regimes and Algorithmic Implementation

SCA supports two decoding paradigms:

Regime Clock Normalization Use Case Notes
Parallel Normalized Global length known Clocks in [0,1][0,1]; full matrix support
Autoregressive Unnormalized Left-to-right decoding Causal mask; increment clocks with history

In parallel mode, the system requires a global length estimate for the target. The attention matrix is computed using normalized clocks and the closed-form score, with softmax applied row-wise.

In the AR regime, SCA operates with unnormalized clocks, incrementally accumulating ϕ(η)\phi(\eta) up to the current frame, ensuring past-only (causal) dependencies. Future tokens are masked.

Pseudocode for SCA primitives (Scala-like) is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
function Clock(x: [B×L×d], mask: [B×L], normalize: Bool):
  g ← φ((x[...,:-1]+x[...,1:])/2) * edge_mask
  z0 ← cumsum(g, dim=–1, pad=0)
  if normalize:
    z ← z0 / sum(g, dim=–1, keepdim=true)
    pos ← (cumsum(mask,–1)–0.5)/sum(mask,–1)
    var ← pos*(1–pos)         // Brownian‐bridge
  else:
    z ← z0
    pos ← cumsum(mask,–1)–0.5
    var ← pos                 // diffusive
  return (z*mask, var)

function ClockDiffScore(η_q, η_k, q_mask, k_mask, normalize):
  (λ_q, var_q)=Clock(η_q, q_mask, normalize)
  (λ_k, var_k)=Clock(η_k, k_mask, normalize)
  Σ2=var_q/len_q + (var_k/len_k).T
  dist2=‖λ_q‖² + ‖λ_k‖² – 2·λ_q·λ_k^T
  S= – dist2 / (2·sqrt(d)·Σ2 + ε)
  mask out invalid positions
  return S

5. Practical Implementation: Architecture, Hyperparameters, and Training

SCA was implemented within a 6-layer Transformer encoder and a 4-layer Transformer decoder, both operating at dmodel=256d_{\text{model}} = 256 with 4 heads and standard feedforward blocks. Projection matrices (Wq,Wk)(W_q, W_k) in R256×d\mathbb{R}^{256\times d} and (Wv)(W_v) in R256×dv\mathbb{R}^{256\times d_v} are used. "MaskedTimeNorm" applies per-timestep normalization for stability, adding ε=105\varepsilon = 10^{-5}. The rate function is ϕ(x)=12[1+x(1+x+x)/(1+x)]+ε\phi(x) = \frac{1}{2}[1 + x(1+x+|x|)/(1+|x|)] + \varepsilon, ε=103\varepsilon = 10^{-3}.

The squared difference in clock space is divided by d\sqrt{d}, and a learnable logit scale (initialized to 1.0) modulates the scoring. Training uses AdamW (lr=104\text{lr}=10^{-4}, weight decay=102\text{weight decay}=10^{-2}, batch size 48) for 20002\,000 epochs, optimizing L1L_1 loss on mel-spectrograms. In the parallel regime, the mel-to-phoneme ratio (MPR) is swept from $3.0$ to $10.0$; in AR, max\max MPR is set to $7.0$.

6. Experimental Evaluation: Speech Synthesis Performance and Alignment

On the LJSpeech-1.1 corpus (13,100 utterances, 80-dim mel-spectrograms, 22.05 kHz), inference was conducted using a fixed HiFi-GAN vocoder. Evaluations used both Whisper and wav2vec2-CTC automatic speech recognition.

Parallel Decoding at MPR = 6.0:

  • SDPA: WER = 7.39% ±\pm 0.22; CER = 3.94% ±\pm 0.14
  • SCA (normalized): WER = 7.03% ±\pm 0.20; CER = 3.66% ±\pm 0.12

SDPA's performance degrades for MPR outside [3.0,10.0][3.0, 10.0] (alignment blur, over- or under-generation), while SCA maintains WER <10%<10\% for the full tested range, demonstrating robust speed/rate control.

Autoregressive Decoding:

  • SDPA yields no coherent alignments (WER, CER 100%\to 100\%) under causal masks and teacher-forcing.
  • SCA (unnormalized): WER = 66.5%; CER = 48.5% on 1,852 evaluated ARCTIC+Harvard sentences.

Visual analysis of attention matrices reveals that SCA produces sharper, near-diagonal, and continuous attention, versus drifting or noisy patterns for SDPA. The mid-sequence "softening" observed matches the theoretical Brownian-bridge variance, suggesting a useful inductive bias against over-rigid alignment.

7. Extensions, Applications, and Limitations

SCA generalizes to multi-scale or hierarchical clocks that couple global and local rate modeling and could inform clock-guided diffusion or flow-matching decoders. Its potential extends to continuous sequence alignment tasks beyond audio, including video frame alignment, motion capture, and expressive music performance generation.

Limitations include:

  • The strict monotonic alignment assumption precludes non-monotonic (large reordering) alignments.
  • The normalized clock regime requires a known or accurately predicted global target length.
  • For discrete text generation, global length control may be semantically risky without explicit external modeling.

SCA thus provides an inductively structured, nearly parameter-free attention mechanism that enforces monotonicity and continuity by construction, offers a closed-form Gaussian scoring rule based on learned clock integrals, and improves robustness and stability of sequence alignment compared to conventional SDPA—especially in tasks requiring precise, time-synchronous mappings (Soh et al., 18 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Stochastic Clock Attention.