CAWN: Continuous Acoustic Wave Networks for Autoregressive Language Modeling

Published 5 Apr 2026 in cs.CL | (2604.04250v1)

Abstract: Modern LLMs rely on Transformer self-attention, which scales quadratically with sequence length. Recent linear-time alternatives, like State Space Models (SSMs), often suffer from signal degradation over extended contexts. We introduce the Continuous Acoustic Wave Network (CAWN), a fully continuous sequence-mixing architecture. Instead of discrete matrix-based attention, CAWN projects hidden states into multi-headed complex-domain phasors, achieving sequence mixing through a causal, $O(L)$ Phase Accumulation mechanism. To prevent signal degradation over ultra-long contexts, we introduce a dual-gated Selective Phase Resonance mechanism incorporating Frequency-Dependent Retention, Hard-Threshold Gating via Straight-Through Estimation, and a Temporal Syntax Cache to capture short-term local dependencies. We also replace standard dense linear projections with Depth-wise Harmonic Convolutions for optimal spatial frequency mixing, augmented by Block Attention Residuals for depth-wise state routing. Scaled to a 150M-parameter model, CAWN utilizes custom Triton kernels for hardware-efficient, true-complex phase accumulation in float32. Trained via a continuous streaming loop on a 100-Billion-token corpus, the prototype is evaluated at a 5-Billion-token milestone. Empirical evaluations via a Targeted Semantic Retrieval protocol demonstrate robust vocabulary acquisition and extended explicitly learned contextual denoising. By leveraging $O(1)$ state-passing via chunked prefill, the model retrieves targeted information across 2,000,000 tokens while strictly plateauing at 8.72 GB of Peak VRAM, empirically overcoming the $O(L^2)$ context memory wall.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a novel Resonance Layer that encodes sequence information using continuous multi-harmonic phase interference in the complex domain.
It achieves strict linear memory scaling and constant-speed autoregressive generation, surpassing traditional Transformers in handling ultra-long contexts.
Empirical results demonstrate improved perplexity and robust zero-shot performance, with successful retrieval over contexts reaching millions of tokens.

Continuous Acoustic Wave Networks for Autoregressive Language Modeling

Motivation and Background

The Transformer architecture and its quadratic-time self-attention mechanism have historically imposed fundamental limits on context window size, creating both computational and memory bottlenecks as sequence lengths increase. Recent work has introduced linear-time alternatives such as State Space Models (SSMs) and Fourier-based approaches. However, these models often exhibit degradation in signal fidelity over ultra-long contexts due to compressive state representations or limitations in frequency-domain mixing. “CAWN: Continuous Acoustic Wave Networks for Autoregressive Language Modeling” (2604.04250) proposes an alternative to explicit attention and discrete context compression by directly leveraging sequence mixing through continuous-time multi-harmonic complex wave interference in the complex domain.

Architectural Innovations

CAWN displaces the standard self-attention layer with a fully continuous Resonance Layer. Sequence information is encoded and propagated as streams of multi-headed phasors in the complex domain, fundamentally altering sequence mixing mechanics. Key architectural components include:

Complex-Domain Wave Embedding: Hidden states are projected to magnitude/phase tuples (phasors), with grammar and semantics realized as constructive/destructive wave interference, rather than explicit matrix-based affinity computation.
Causal Phase Accumulation: Implements an $\mathcal{O}(L)$ causal phase accumulator, applying true-complex phase rotations at each timestep to encode both historical information and relative position. This overcomes the context-memorization limitations of SSMs and RNNs by employing exact phase-based encoding.
Selective Phase Resonance Gating: Combines frequency-dependent retention (favoring long-term storage at low harmonics), hard-threshold input gating via straight-through estimation, and a 1D-convolution-based Temporal Syntax Cache for short-term syntactic locality, together preventing both signal washout and unbounded constructive interference.
Depth-wise Harmonic Convolution with Block Attention Residuals: Replaces standard projections with convolutions that enable cross-harmonic interaction, further augmented by block-wise attention over depth (not time) via the Block Attention Residual mechanism (Team et al., 16 Mar 2026), thus enabling explicit recovery of pristine signals from earlier blocks under strict $\mathcal{O}(L)$ constraints.

Methodology

The model is instantiated at a 150M parameter scale with an embedding dimension of $D=896$ , 16 stacked Resonance Layers, 4 acoustic heads each with 64 harmonics (yielding 256 harmonic wave channels), and a standard feed-forward expansion. The training regime utilizes infinite streaming from a 100B-token English corpus, avoiding fixed-epoch overfitting and explicitly challenging the architecture with data augmentations: high-entropy noise blocks mixed with targeted semantic queries to enforce robust denoising and associative recall capabilities.

Numerically, all phase operations are performed in float32 for stability, with the rest of the model running in bfloat16. All phase accumulator operations are implemented using custom Triton kernels [triton], exploiting fast SRAM access and supporting strict and bounded recursive operations on complex values.

Empirical Results

Memory and Throughput

Strong, unambiguous empirical results are presented in bypassing the $\mathcal{O}(L^2)$ memory and compute wall:

CAWN exhibits strict linear scaling in VRAM usage during prefill, exponentially surpassing the standard Transformer’s 8GB limit for context windows (e.g., 4.91 GB for 8192 tokens vs. OOM at 4096 tokens for Transformer).
With chunked prefill and state-passing, CAWN’s VRAM plateaus at 8.72 GB even for 2,000,000 tokens, a direct manifestation of $\mathcal{O}(1)$ memory scaling with fixed-size phase state.
Autoregressive generation speed is perfectly flat with respect to sequence context length, in contrast to the denominator effect of Key-Value cache expansion in conventional architectures.

Language Modeling and Reasoning

Validation perplexity descends monotonically on WikiText-103, crossing and outperforming the Pythia-160M parity (perplexity 127.85 at 2.1B tokens) at around 300k steps, with CAWN achieving perplexity $\sim$ 75 after 5.4B tokens—clear evidence that context compression does not cripple grammatical/semantic modeling efficiency.

Zero-shot performance on PIQA and ARC-Easy at 5B tokens (60.23% and 45.45%, respectively) well-exceeds the parameter-matched Transformer trained on 2.1B tokens (55.50%/30.64%), demonstrating intact and emergent reasoning competency in a fully continuous, wave-based compression regime.

Extreme Context Extrapolation

CAWN reliably retrieves semantically marked tokens at distances up to 1,000,000 tokens. At 2,000,000 tokens, precision degrades for certain harmonics (possibly due to phase periodicity/cancellation), but successful retrieval is still observed for other targets, empirically demonstrating multi-order-of-magnitude improvements in context window compared to both SSMs and traditional architectures.

Theoretical and Practical Implications

The results establish that explicit matrix-based attention is not a necessary condition for high-precision associative recall and semantic reasoning over long contexts. The CAWN architecture’s strict $\mathcal{O}(L)$ (prefill) and $\mathcal{O}(1)$ (autoregressive) scaling, phase-based context encoding, and explicit block-state attention routing have important implications:

Models can operate with effectively unbounded context length without hardware scaling in memory, fundamentally enabling applications in domains where ultra-long dependencies are critical.
The system’s reliance on stable, interpretable complex-domain mechanics may offer new avenues for analysis and interpretability of underlying representations.
Future architectures may blend or entirely replace attention with physically-inspired, continuous mechanisms, yielding new directions in efficient large context, data-efficient, or streaming LMs.

Scaling the approach to larger parameter counts, higher data volumes, and more diverse data regimes is a direct next step. Further, rigorous ablations (including on the gating mechanisms, phase encoding strategies, and the real statistical properties of the phase accumulator over time) are necessary to map actual limits and potential degradation modes.

Limitations and Future Work

The prototype reveals salient limitations:

For extreme context scaling, numerical instabilities in float16 were observed, requiring fallback to float32 or higher-precision accumulator paths.
While data augmentation for denoising contributed empirically to context retention, its isolated effect versus inherent phase dynamics remains to be established.
The architecture has not yet been scaled to the B-/TB-scale regimes characteristic of frontier LMs, which may reveal further hardware or statistical bottlenecks.

Additional work should address hardware co-design for the complex domain and analyze the spectral dynamics of phase states under long-term operation, particularly around phase/amplitude cancellation and harmonics.

Conclusion

CAWN delivers a novel architectural paradigm for language modeling, establishing that continuous, multi-harmonic phase interference in the complex domain can supplant quadratic-time attention for both context mixing and semantic reasoning. This paradigm achieves a strict $\mathcal{O}(L)$ / $\mathcal{O}(1)$ scaling profile, theoretically and empirically overcoming context window limits, with robust denoising and recall over up to two million tokens. The approach invites deeper investigation at scale and portends broader adoption of continuous dynamical systems for efficient, scalable sequence modeling (2604.04250).