Stochastic Clock Attention
- Stochastic Clock Attention is a cross-attention mechanism that uses learned clock processes to generate continuous, monotonic alignments between source and target sequences.
- It replaces standard scaled dot-product attention with a Gaussian kernel based on normalized or unnormalized clock integrals, enforcing near-diagonal and causal mappings.
- SCA supports both parallel and autoregressive decoding regimes, showing robust performance in text-to-speech and other time-synchronous tasks.
Stochastic Clock Attention (SCA) is a cross-attention mechanism designed for sequence-to-sequence modeling where alignment between continuous, ordered sequences is central. Unlike standard scaled dot-product attention (SDPA) that relies on external positional encodings and lacks guarantees for monotonicity or continuity, SCA formulates attention as the meeting probability of two learned, nonnegative "clock" processes, each parameterizing normalized "time" for the source and target. This approach yields an explicit, probabilistic alignment model with inherent inductive biases for causal, smooth, and near-diagonal mappings—key for frame-synchronous tasks such as text-to-speech (TTS). SCA supports both normalized (parallel) and unnormalized (autoregressive) decoding regimes and acts as a nearly parameter-free drop-in replacement for conventional cross-attention modules (Soh et al., 18 Sep 2025).
1. Mathematical Formulation and Path-Integral Derivation
SCA designates the input sequences () and () as source and target, respectively. Both undergo learned feature projections: A nonnegative rate function (e.g., Softplus, ) transforms the projections. In the parallel regime (normalized clocks), cumulative integrals define the normalized clocks: with providing strictly monotonic time reparameterizations. In the autoregressive regime, unnormalized clocks
are used.
Alignment between source and target is modeled by the meeting probability kernel: Assuming the projections are perturbed by zero-mean Gaussian fields, a perturbative expansion yields a Gaussian kernel in clock space: where and aggregates covariances from both clocks. Under stationarity and the delta-method, the variance profile approximates Brownian-bridge behavior: For unnormalized clocks, the variance grows linearly (diffusively) with and .
The attention score simplifies (with absorbed by the row-wise softmax):
2. Probabilistic Inductive Biases: Continuity, Monotonicity, and Alignment
SCA's rate function ensures the clocks and are strictly monotonic, providing aligned trajectories. The quadratic penalty on intrinsically biases attention toward the diagonal, enforcing smooth, continuous alignments. The Brownian-bridge variance profile is lowest at endpoints and maximal at the center, leading SCA to favor sharper (more certain) alignments at boundaries and softer, continuous transitions in the mid-sequence. Causal structure is imposed in the AR regime by masking keys with , ensuring causal attention propagation.
3. Scoring Rule and Contrast with Scaled Dot-Product Attention
Conventional SDPA computes scores as
SCA, in contrast, defines the score by a Gaussian kernel in clock space: This construction entirely replaces the need for positional encodings: temporal alignment is achieved via the learned clock integrals, not through hand-crafted features or sinusoids. SCA introduces few additional parameters, integrating naturally as a plug-and-play module in Transformer-style cross-attention layers.
4. Decoding Regimes and Algorithmic Implementation
SCA supports two decoding paradigms:
| Regime | Clock Normalization | Use Case | Notes |
|---|---|---|---|
| Parallel | Normalized | Global length known | Clocks in ; full matrix support |
| Autoregressive | Unnormalized | Left-to-right decoding | Causal mask; increment clocks with history |
In parallel mode, the system requires a global length estimate for the target. The attention matrix is computed using normalized clocks and the closed-form score, with softmax applied row-wise.
In the AR regime, SCA operates with unnormalized clocks, incrementally accumulating up to the current frame, ensuring past-only (causal) dependencies. Future tokens are masked.
Pseudocode for SCA primitives (Scala-like) is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
function Clock(x: [B×L×d], mask: [B×L], normalize: Bool):
g ← φ((x[...,:-1]+x[...,1:])/2) * edge_mask
z0 ← cumsum(g, dim=–1, pad=0)
if normalize:
z ← z0 / sum(g, dim=–1, keepdim=true)
pos ← (cumsum(mask,–1)–0.5)/sum(mask,–1)
var ← pos*(1–pos) // Brownian‐bridge
else:
z ← z0
pos ← cumsum(mask,–1)–0.5
var ← pos // diffusive
return (z*mask, var)
function ClockDiffScore(η_q, η_k, q_mask, k_mask, normalize):
(λ_q, var_q)=Clock(η_q, q_mask, normalize)
(λ_k, var_k)=Clock(η_k, k_mask, normalize)
Σ2=var_q/len_q + (var_k/len_k).T
dist2=‖λ_q‖² + ‖λ_k‖² – 2·λ_q·λ_k^T
S= – dist2 / (2·sqrt(d)·Σ2 + ε)
mask out invalid positions
return S |
5. Practical Implementation: Architecture, Hyperparameters, and Training
SCA was implemented within a 6-layer Transformer encoder and a 4-layer Transformer decoder, both operating at with 4 heads and standard feedforward blocks. Projection matrices in and in are used. "MaskedTimeNorm" applies per-timestep normalization for stability, adding . The rate function is , .
The squared difference in clock space is divided by , and a learnable logit scale (initialized to 1.0) modulates the scoring. Training uses AdamW (, , batch size 48) for epochs, optimizing loss on mel-spectrograms. In the parallel regime, the mel-to-phoneme ratio (MPR) is swept from $3.0$ to $10.0$; in AR, MPR is set to $7.0$.
6. Experimental Evaluation: Speech Synthesis Performance and Alignment
On the LJSpeech-1.1 corpus (13,100 utterances, 80-dim mel-spectrograms, 22.05 kHz), inference was conducted using a fixed HiFi-GAN vocoder. Evaluations used both Whisper and wav2vec2-CTC automatic speech recognition.
Parallel Decoding at MPR = 6.0:
SDPA's performance degrades for MPR outside (alignment blur, over- or under-generation), while SCA maintains WER for the full tested range, demonstrating robust speed/rate control.
Autoregressive Decoding:
- SDPA yields no coherent alignments (WER, CER ) under causal masks and teacher-forcing.
- SCA (unnormalized): WER = 66.5%; CER = 48.5% on 1,852 evaluated ARCTIC+Harvard sentences.
Visual analysis of attention matrices reveals that SCA produces sharper, near-diagonal, and continuous attention, versus drifting or noisy patterns for SDPA. The mid-sequence "softening" observed matches the theoretical Brownian-bridge variance, suggesting a useful inductive bias against over-rigid alignment.
7. Extensions, Applications, and Limitations
SCA generalizes to multi-scale or hierarchical clocks that couple global and local rate modeling and could inform clock-guided diffusion or flow-matching decoders. Its potential extends to continuous sequence alignment tasks beyond audio, including video frame alignment, motion capture, and expressive music performance generation.
Limitations include:
- The strict monotonic alignment assumption precludes non-monotonic (large reordering) alignments.
- The normalized clock regime requires a known or accurately predicted global target length.
- For discrete text generation, global length control may be semantically risky without explicit external modeling.
SCA thus provides an inductively structured, nearly parameter-free attention mechanism that enforces monotonicity and continuity by construction, offers a closed-form Gaussian scoring rule based on learned clock integrals, and improves robustness and stability of sequence alignment compared to conventional SDPA—especially in tasks requiring precise, time-synchronous mappings (Soh et al., 18 Sep 2025).