Streaming Causal Attention Masks (SCAM)

Updated 4 July 2026

SCAM is a family of structured attention masking schemes that enforce causality with controlled lookahead for efficient, real-time streaming processing.
Variants like StreamFlow, StableMask, U2 Whisper, LiveStarPro, and S5-TTS demonstrate distinct mechanisms to shape context, adjust normalization, and maintain semantic integrity.
SCAM enables fixed or bounded computational costs, delivering improved latency and performance across modalities while managing trade-offs between future access and context.

Streaming Causal Attention Masks (SCAM) denote a family of attention-masking schemes for incremental sequence processing in which causality is enforced under streaming constraints and the visible context is shaped to control latency, alignment, extrapolation, and boundary behavior. The acronym is explicit in LiveStarPro, where SCAM is a training strategy for incremental video-language alignment, but closely related mechanisms appear under other names: block-wise guided attention masks in StreamFlow, pseudo-attention refinement of causal masking in StableMask, dynamic causal masks and diagonal causal masks in streaming Whisper, and lookahead-causal masking in S5-TTS (Yang et al., 16 Jun 2026, Guo et al., 30 Jun 2025, Yin et al., 2024, Zhou et al., 13 Jun 2025, Du et al., 20 Jun 2026).

1. Taxonomy of SCAM formulations

SCAM is not a single canonical mask. In the recent literature, it designates a design space of structured visibility constraints tailored to streaming objectives. Some variants are strictly causal, some admit bounded lookahead, some alter the normalization geometry of attention, and some encode modality- or segment-aware exclusions that go beyond ordinary lower-triangular masking.

Work	Mask structure	Primary function
StreamFlow	Block, backward, and forward masks over token blocks	Fixed-size sliding receptive field for streaming flow matching
StableMask	Pseudo-attention in masked future positions; suffix-pseudo compression at inference	Sub-stochastic causal attention, extrapolation stability, KV-cache compatibility
U2 Whisper	Dynamic chunkwise encoder masks and diagonal decoder causal mask	Streaming ASR partials and batched rescoring
LiveStarPro	Time-, modality-, clip-, and segment-aware binary mask	Incremental video-language alignment without intra-clip caption copying
S5-TTS	Prefix-plus-lookahead masks in encoder self-attention and decoder cross-attention	Word-by-word streaming TTS with limited lookahead

Two distinctions organize this taxonomy. First, SCAM may operate as an architectural locality constraint, as in StreamFlow’s block-wise receptive fields or S5-TTS’s limited word lookahead. Second, it may operate as a distributional correction, as in StableMask, where the causal mask itself is refined so that attention over real tokens is sub-stochastic rather than forced to sum to one. LiveStarPro adds a third axis: semantic masking, in which the mask suppresses shortcut paths that are causal in time but undesirable for grounding because they permit copying earlier captions from the same clip (Guo et al., 30 Jun 2025, Yin et al., 2024, Yang et al., 16 Jun 2026).

2. Mathematical forms of masking

A common backbone across these systems is masked scaled dot-product attention. StreamFlow and U2 Whisper both state the standard form

$A = \mathrm{softmax}\!\left(\frac{QK^\top + M}{\sqrt{d_k}}\right),$

with mask entries set to $0$ for allowed keys and $-\infty$ otherwise. What differentiates SCAM variants is therefore not the attention operator itself but the structure of $M$ (Guo et al., 30 Jun 2025, Zhou et al., 13 Jun 2025).

In StreamFlow, the sequence is segmented into blocks of size $B$ , with block index

$b(i)=\left\lfloor \frac{i}{B}\right\rfloor.$

SCAM is realized through three block-wise masks. The Block Mask permits only intra-block attention; the Backward Mask permits attention to the current block and the immediately preceding block; the Forward Mask permits attention to the current block and the immediately subsequent block. These masks are assigned across DiT layers, so receptive-field growth is produced by multi-hop propagation rather than by a single wide attention layer. If $p$ layers use backward masks and $q$ layers use forward masks, the effective receptive field spans $(p+q+1)B$ tokens (Guo et al., 30 Jun 2025).

StableMask departs more radically from classical masking. Instead of filling the upper triangular region with $-\infty$ , it inserts pseudo-attention logits into future positions before softmax and reapplies the causal mask afterward:

$0$0

Here $0$1 is the lower-triangular indicator and $0$2 fills future positions with linearly decaying pseudo logits. The effect is that some probability mass is absorbed by pseudo channels rather than being forced onto visible tokens, so the row sum over real tokens becomes strictly less than $0$3. In streaming inference, the full set of pseudo channels is compressed into a single suffix-pseudo channel with logit $0$4, preserving KV-cache reuse while retaining the absorption mechanism (Yin et al., 2024).

U2 Whisper uses chunkwise causal masking rather than block-graph composition or pseudo-attention. A general encoder mask is written as

$0$5

where $0$6 is left context and $0$7 is right lookahead. During training, chunk sizes are sampled uniformly between $0$8 and $0$9 seconds, and the encoder learns to operate under dynamic attention masks matching those chunk configurations. Decoder self-attention in the second-pass rescoring stage uses the strictly causal mask

$-\infty$ 0

which permits batched teacher-forced rescoring while preserving autoregressive factorization (Zhou et al., 13 Jun 2025).

LiveStarPro defines SCAM over mixed multimodal token streams with metadata for modality, time, clip id, and caption segment id. The base causal rule is $-\infty$ 1 only if $-\infty$ 2. Video-to-video attention is time-causal; text-to-video attention is time-causal; video-to-text attention is blocked by default; and text-to-text attention is allowed only within the current caption segment, plus a whitelist over terminal captions of earlier clips. The core rule is streaming intra-clip caption suppression: when generating the caption for the current frame in clip $-\infty$ 3, text tokens from earlier caption segments in the same clip are masked, even though they are in the past. This prevents the model from copying prior narration inside an event segment and forces grounding in current visual evidence (Yang et al., 16 Jun 2026).

S5-TTS instantiates SCAM as lookahead-causal masking at the word level. If $-\infty$ 4 maps encoder step $-\infty$ 5 to a word index and $-\infty$ 6 maps decoder step $-\infty$ 7 to its aligned word index, the encoder self-attention mask is

$-\infty$ 8

and the decoder cross-attention mask is

$-\infty$ 9

Both masks therefore expose the causal prefix plus at most $M$ 0 lookahead words. This constrains future access without reverting to a fully myopic decoder (Du et al., 20 Jun 2026).

3. Integration into model architectures

In StreamFlow, SCAM is integrated directly into a 22-layer DiT that parameterizes the time-dependent vector field for flow matching under Optimal Transport Conditional Flow Matching. The masks are enforced at every DiT self-attention layer during both training and inference, so the model learns and is evaluated under the same fixed-window constraints. Reported schedules include StreamFlow-SR, with forward masking at layer $M$ 1 and backward masking at layers $M$ 2 and $M$ 3, and StreamFlow-LR, which adds a second forward mask at layer $M$ 4. The decoder then runs in a sliding-window manner over chunks of two blocks, and BigVGAN upsamples mel-spectrogram chunks to waveform with similar chunked convolution (Guo et al., 30 Jun 2025).

StableMask is integrated into decoder-only Transformers as a parameter-free replacement for classical causal masking. It is explicitly designed to remain compatible with relative position encoding, KV caching, and FlashAttention. The paper describes fusion into FlashAttention kernels through on-chip addition of the pseudo-attention term and causal indicator, followed by numerically stable blockwise softmax. In this architecture, SCAM is not a streaming window policy but a refinement of normalization and positional signaling inside ordinary causal decoding (Yin et al., 2024).

In the Whisper adaptation, masking is embedded in a two-pass U2 structure. Dynamic causal masks are applied to all encoder self-attention layers during training so that the encoder becomes streaming-compatible, a CTC head produces chunkwise partial hypotheses in the first pass, and the original Whisper decoder performs second-pass rescoring with a diagonal causal self-attention mask. The hybrid loss is

$M$ 5

and training proceeds through an attention-only adaptation stage, a frozen-backbone CTC stage, and a final joint stage (Zhou et al., 13 Jun 2025).

In LiveStarPro, SCAM is explicitly a training-time alignment strategy rather than an inference-time attention mechanism. Training samples are interleaved frame-caption streams derived from semantic clip decomposition. SCAM is computed per chunk within an 8K-token window and uses ragged metadata arrays for time, clip id, and segment id. Inference instead uses Streaming Verification Decoding with a streaming KV cache. The paper therefore separates the roles of masking and caching: SCAM shapes the learned probability landscape, while SVeD and the cache exploit that landscape online (Yang et al., 16 Jun 2026).

In S5-TTS, SCAM is active in both training and inference. The encoder is a T5-like Transformer over phonemes, and the decoder is an autoregressive Transformer that predicts $M$ 6 codec tokens per step. The masks constrain encoder self-attention and decoder cross-attention to a prefix-plus- $M$ 7-lookahead view. Monotonic Alignment Learning applies a CTC loss to cross-attention weights, a Conv-based auxiliary attention produces a robust decoder-to-encoder step map for mask construction, and interleaved multi-source distillation transfers naturalness from a full-context T5-TTS teacher (Du et al., 20 Jun 2026).

4. Streaming mechanics, scaling, and latency

A central virtue of SCAM is that it often converts growing-context inference into fixed-cost or bounded-cost inference. In StreamFlow, each chunk window has length

$M$ 8

comprising the current two-block chunk plus backward and forward context blocks. Because $M$ 9 does not grow with total sequence length, per-chunk attention cost per layer is $B$ 0, memory footprint is $B$ 1 per layer, and no growing KV cache is needed. The reported first-packet latency is approximately $B$ 2 ms on NVIDIA A100 for both StreamFlow-SR and StreamFlow-LR (Guo et al., 30 Jun 2025).

StableMask preserves the asymptotic complexity of standard attention, adding essentially one extra scalar $B$ 3 per row under suffix compression. Its streaming significance lies in normalization stability rather than window truncation: suffix compression restores KV-cache compatibility, and windowed caching of recent KVs remains stable without preserving initial sink tokens, unlike StreamingLLM-style heuristics (Yin et al., 2024).

In U2 Whisper, the computational profile is governed by chunk size and maximum delay. Chunk sizes during training are sampled between $B$ 4 and $B$ 5 seconds; the default chunk size used in evaluation is $B$ 6 s; partial transcript computation is approximately $B$ 7 ms in the reported setup; and longer max-delay values improve WER at the cost of increased runtime because computation is quadratic in input length. The encoder KV cache implements left memory across chunks, and diagonal-causal rescoring avoids token-by-token autoregressive decoding in the second pass (Zhou et al., 13 Jun 2025).

LiveStarPro reports a different benefit profile. SCAM is not required at inference, but the streaming key-value cache in the full system yields a $B$ 8 inference speedup over the same model without caching. The relevance of SCAM is indirect but structural: the model is trained to make probability estimates under strictly incremental, leakage-free context, which is necessary for SVeD’s perplexity-based verification gate (Yang et al., 16 Jun 2026).

S5-TTS bounds latency through word-level lookahead. Generation begins after the first $B$ 9 words arrive; with limited lookahead, the model streams codec chunks word by word and stitches them with a 2-frame overlap and Hanning crossfade. On B200 GPU, first-chunk latency is reported as $b(i)=\left\lfloor \frac{i}{B}\right\rfloor.$ 0 s for S5-TTS and $b(i)=\left\lfloor \frac{i}{B}\right\rfloor.$ 1 s for S5-TTS+IMSD, versus $b(i)=\left\lfloor \frac{i}{B}\right\rfloor.$ 2 s for full-context T5-TTS. End-to-end latency in an LLM+TTS stack falls from $b(i)=\left\lfloor \frac{i}{B}\right\rfloor.$ 3 s to $b(i)=\left\lfloor \frac{i}{B}\right\rfloor.$ 4 s on UltraChat, and from $b(i)=\left\lfloor \frac{i}{B}\right\rfloor.$ 5 s to $b(i)=\left\lfloor \frac{i}{B}\right\rfloor.$ 6 s on LibriTTS (Du et al., 20 Jun 2026).

5. Empirical behavior across modalities

In speech token decoding, StreamFlow’s SCAM is evaluated against both non-streaming and strictly causal streaming baselines. On objective metrics, DiT-CVS records STOI $b(i)=\left\lfloor \frac{i}{B}\right\rfloor.$ 7, UTMOS $b(i)=\left\lfloor \frac{i}{B}\right\rfloor.$ 8, and PESQ $b(i)=\left\lfloor \frac{i}{B}\right\rfloor.$ 9, whereas StreamFlow-SR records STOI $p$ 0, UTMOS $p$ 1, and PESQ $p$ 2, and StreamFlow-LR records STOI $p$ 3, UTMOS $p$ 4, and PESQ $p$ 5. Subjective results show DiT-CVS at NMOS $p$ 6 and SMOS $p$ 7, StreamFlow-SR at NMOS $p$ 8 and SMOS $p$ 9, and StreamFlow-LR at NMOS $q$ 0 and SMOS $q$ 1. Block-size ablations further show that increasing $q$ 2 from $q$ 3 s to $q$ 4 s improves STOI from $q$ 5 to $q$ 6 and PESQ from $q$ 7 to $q$ 8 (Guo et al., 30 Jun 2025).

In decoder-only language modeling, StableMask improves perplexity across model sizes and position-encoding schemes. For a 160M RoPE model on WikiText-103, perplexity improves from $q$ 9 to $(p+q+1)B$ 0. On OpenLLaMA 1.4B with RoPE on the Pile, the reported perplexities at $(p+q+1)B$ 1B/ $(p+q+1)B$ 2B/ $(p+q+1)B$ 3B/ $(p+q+1)B$ 4B/ $(p+q+1)B$ 5B tokens are $(p+q+1)B$ 6 for the baseline and $(p+q+1)B$ 7 for StableMask. The paper also reports that windowed attention extrapolation remains stable without preserving initial tokens and that synthetic absolute-position tasks improve from approximately $(p+q+1)B$ 8– $(p+q+1)B$ 9 accuracy for RPE variants to more than $-\infty$ 0 for APE variants, with StableMask restoring positional awareness via sub-stochastic rows (Yin et al., 2024).

In streaming ASR, U2 Whisper shows the expected accuracy-latency trade-off. On the earnings dataset, chunk-size sweeps yield $-\infty$ 1 WER at $-\infty$ 2 ms with rescoring, $-\infty$ 3 at $-\infty$ 4 ms, and $-\infty$ 5 at $-\infty$ 6 ms. Max-delay sweeps on 4 vCPU Xeon 6240 with 8-bit quantization give $-\infty$ 7 WER, RTF $-\infty$ 8, and average finalize latency $-\infty$ 9 ms at $0$00 s, versus $0$01 WER, RTF $0$02, and average finalize latency $0$03 ms at $0$04 s, and $0$05 WER, RTF $0$06, and average finalize latency $0$07 ms at $0$08 s. The hybrid tokenizer improves data efficiency, for example from $0$09 to $0$10 WER at $0$11 h of training data (Zhou et al., 13 Jun 2025).

In proactive video understanding, LiveStarPro reports that the offline OmniStarPro-RNG SemCor score rises from $0$12 for the InternVideo2.5 backbone without streaming fine-tuning to $0$13 for LiveStar and $0$14 for LiveStarPro. In online OmniStarPro tasks, SCAM together with SVeD yields a $0$15 improvement in semantic correctness and an $0$16 reduction in timing error relative to prior online Video-LLMs (Yang et al., 16 Jun 2026).

In streaming TTS, S5-TTS identifies a narrow useful lookahead regime. On LibriTTS unseen, $0$17 gives CER $0$18, WER $0$19, SSIM $0$20, and UTMOS $0$21; $0$22 gives CER $0$23, WER $0$24, SSIM $0$25, and UTMOS $0$26; $0$27 degrades to CER $0$28 and WER $0$29. Ablations at $0$30 show WER $0$31 for full S5-TTS, $0$32 when both encoder and decoder masks are removed, $0$33 when only the encoder mask is removed, and $0$34 when only the decoder mask is removed. With IMSD, WER improves from $0$35 to $0$36 and UTMOS from $0$37 to $0$38 on LibriTTS (Du et al., 20 Jun 2026).

6. Trade-offs, limitations, and conceptual significance

The most persistent trade-off in SCAM design is between future access and latency. StreamFlow states that one-block forward attention improves boundary smoothness and overall quality but increases first-packet latency proportionally to the lookahead. S5-TTS likewise finds that $0$39–$0$40 is a practical sweet spot, whereas $0$41 distracts alignment and hurts intelligibility. U2 Whisper shows the same pattern at the chunk level: tighter chunking and shorter delays reduce latency but increase WER and formatting errors (Guo et al., 30 Jun 2025, Du et al., 20 Jun 2026, Zhou et al., 13 Jun 2025).

A second trade-off concerns locality versus long-range dependency modeling. StableMask warns that pseudo-attention can over-absorb probability mass if $0$42 is too large, reducing attention paid to useful long-range context; moderate $0$43 and head-wise tuning are recommended. LiveStarPro notes an analogous risk in semantic masking: if clip segmentation is too coarse, suppressing prior intra-clip captions may be overly strict when multiple events occur rapidly in the same clip. These are different mechanisms, but both illustrate that SCAM is effective only when its visibility constraints align with the latent event structure of the stream (Yin et al., 2024, Yang et al., 16 Jun 2026).

A common misconception is that streaming necessarily implies zero future context. The surveyed systems show a narrower and more technical claim: streaming requires that the model not condition on unavailable future information, but limited lookahead can remain compatible with streaming when that information is already present in the upstream pipeline or current chunk. StreamFlow explicitly states that one-block future attention does not violate streaming behavior because the subsequent semantic tokens are available from the upstream Codec-LM; S5-TTS begins speaking after the first $0$44 words; and U2 Whisper allows a small right lookahead within chunkwise encoder masks during training (Guo et al., 30 Jun 2025, Du et al., 20 Jun 2026, Zhou et al., 13 Jun 2025).

Another misconception is that masking is merely an efficiency device. StableMask shows that altering the mask can change the normalization regime and restore absolute positional signaling under RPE; LiveStarPro shows that mask design can determine whether caption probabilities are driven by current visual evidence or by textual copying; and S5-TTS ablations show that train-test consistency of the mask, especially in the encoder, is critical for intelligibility under streaming constraints (Yin et al., 2024, Yang et al., 16 Jun 2026, Du et al., 20 Jun 2026).

Taken together, these works indicate that SCAM is best understood as a structural interface between causality and task-specific inductive bias. In some settings it defines a fixed receptive field, in some it reshapes softmax normalization, in some it enforces cross-modal grounding, and in some it aligns linguistic units with bounded future context. StreamFlow further states that SCAM-style masking is broadly applicable to audio generation with vocoders, long-context diffusion text generation, and sliding-window vision or video synthesis, which suggests that the central idea is not modality-specific but architectural: streaming quality depends not only on what context is available, but on which causal paths the mask permits the model to exploit (Guo et al., 30 Jun 2025).