2000 character limit reached

Continuous Autoregressive Language Models

Updated 3 November 2025

Continuous Autoregressive Language Models (CALM) are defined as models that predict continuous latent vectors instead of discrete tokens, boosting semantic bandwidth and generation efficiency.
They leverage high-fidelity autoencoding to compress token chunks with over 99.9% reconstruction accuracy, ensuring robust and error-tolerant latent representations.
CALM employs likelihood-free training with energy score losses and BrierLM metrics, outperforming traditional autoregressive models in efficiency and scalability across modalities.

Continuous Autoregressive LLMs (CALM) are a paradigm in language modeling wherein the generative process transitions from discrete, token-by-token prediction to rich, continuous next-vector prediction. This shift enables dramatic improvements in semantic bandwidth, computational efficiency, and flexibility in generative modeling. CALM encompasses architectures and methodologies for text, audio, video, and cross-modal modeling, supported by specialized training, evaluation, and sampling tools. The framework incorporates concepts from information geometry, spectral analysis, and advanced autoencoding, pairing theoretical depth with empirical performance. The following sections delineate the foundational principles, architectures, mathematical formulations, and implications of CALM.

1. From Discrete Token Prediction to Continuous Next-Vector Modeling

Traditional autoregressive LLMs are constructed as sequential predictors over a discrete vocabulary, maximizing $p(x_{1:T}) = \prod_{t=1}^{T} p(x_t \mid x_{<t})$ per token step. CALM reconceptualizes language as a sequence of continuous latent vectors, each encoding $K$ tokens, thereby amplifying the semantic payload per step and reducing the length of the generative sequence by a factor of %%%%2%%%% (Shao et al., 31 Oct 2025). Autoencoding is employed to compress $K$ tokens into $\mathbf{z} \in \mathbb{R}^l$ , such that the sequence is represented as $\mathbf{Z} = (\mathbf{z}_1, \mathbf{z}_2, ..., \mathbf{z}_L)$ , with $L=T/K$ . The autoregressive model then operates on the latent space:

$p(\mathbf{Z}) = \prod_{i=1}^{L} p(\mathbf{z}_i \mid \mathbf{z}_{<i})$

The autoencoder is trained to reconstruct each $K$ -chunk with high fidelity ( $>99.9\%$ accuracy for practical chunk sizes), using cross-entropy loss over the decoded token sequence. Variational regularization ensures latent robustness, with the encoder outputting Gaussian parameters $(\mu, \sigma^2)$ and KL regularization to the standard normal prior.

A similar shift to continuous autoregression characterizes recent models in audio (Simon et al., 8 Sep 2025), video (Yu et al., 17 Jun 2025), and flexible flows (Zhang et al., 1 Jul 2025). In each domain, modeling proceeds in continuous latent space, replacing quantization with high-fidelity compression and enabling rich, bandwidth-efficient generative steps.

2. High-Fidelity Autoencoding and Robust Latent Representations

The efficacy of CALM depends on constructing autoencoders that are both high-fidelity and robust to generative error accumulation. For text, an encoder (embeddings $\to$ FFN $\to$ linear projection) compresses a $K$ -token chunk; the decoder inverts this map, reconstructing $K$ tokens from the latent vector. The total loss is

$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{ae}} + \beta \cdot \mathcal{L}_{\text{KL}}$

KL clipping and dropout (masking latent and input components) are employed to ensure redundancy and error tolerance. In practice, moderate latent dimensionality $(l = 10 \ldots 128)$ and chunk sizes $(K = 4 \ldots 16)$ suffice for near-perfect reconstruction. The stochastic encoding of latents (sampling from $N(\mu, \sigma^2 I)$ ) ensures robustness under noise, facilitating stable autoregressive generation in the continuous domain.

In audio, a VAE-GAN backbone encodes the waveform as a sequence of continuous latents, optimized with joint time-domain, frequency-domain, adversarial, and KL losses (Simon et al., 8 Sep 2025). For video, continuous tokens extracted from pretrained VAEs are autoregressed with masked parallelism and curriculum learning (Yu et al., 17 Jun 2025). Across modalities, the elimination of lossy quantization is central to CALM’s fidelity and efficiency.

3. Likelihood-Free Training, Evaluation, and Sampling

Autoregressive modeling in continuous space invalidates the softmax-cross-entropy routine. CALM introduces likelihood-free training via strictly proper scoring rules—most notably, the energy score:

$S(P, \mathbf{y}) = \mathbb{E}_{\mathbf{x}', \mathbf{x}'' \sim P}[||\mathbf{x}' - \mathbf{x}''||^\alpha] - 2 \mathbb{E}_{\mathbf{x} \sim P}[||\mathbf{x} - \mathbf{y}||^\alpha]$

Empirical energy loss averages distances between $N$ model samples and $M$ data samples from the autoencoder posterior. Training the generative head (Energy Transformer) to minimize energy score ensures alignment of model and data distributions in latent space, without explicit likelihoods.

For evaluation, CALM employs BrierLM—a sample-based adaptation of the Brier score for n-gram language modeling. BrierLM is calculated as the geometric mean of Brier-1 to Brier-4 scores over sampled outputs:

$\text{BrierLM} = 100 \cdot \left( \prod_{n=1}^{4} \text{Brier-}n \right)^{0.25}$

This metric is strongly correlated with cross-entropy (Spearman’s rank: $-0.991$ ), validating its use as a likelihood-free counterpart to perplexity.

Temperature sampling in continuous models—absent logits—uses a rejection sampling scheme, where $n$ samples per step yield $P(x)^n$ for temperature $T = 1/n$ , and fractional schemes generalize this process. This provides a controlled accuracy-diversity trade-off matching traditional temperature scaling.

4. Continuous Autoregressive Modeling Across Modalities and Architectures

CALM's principles have been generalized beyond text:

Continuous Audio LLMs: Sequence modeling is performed in audio VAE latent space with Transformer-based long and short context modules, and a fast consistency MLP for frame prediction. Noise injection and Gaussian temperature scaling yield efficient and high-fidelity speech/music synthesis, outperforming discrete-token baselines both in quality (e.g., FAD $0.83 \pm 0.04$ vs. $1.06$/$1.43$ for MusicGen baselines) and generation speed (Simon et al., 8 Sep 2025).
VideoMAR: Implements video modeling via continuous tokens with temporal causal autoregression and spatial masked prediction. Next-frame diffusion loss and curriculum strategies enable efficient, high-quality video generation with spatial-temporal extrapolation capacity, supported by 3D rotary position encodings and progressive temperature scheduling during inference (Yu et al., 17 Jun 2025).
TarFlowLM: Employs transformer-based normalizing flows for continuous latent sequences of text. Mixture-of-Gaussian coupling layers provide bi-directional context, patch-wise flexible generation, and hierarchical multi-pass editing; theoretical analysis shows discrete AR transformers are a special case of the flow-based regime (Zhang et al., 1 Jul 2025).

5. Theoretical Foundations: Information Geometry, Spectral Analysis, and Surplus

Recent theoretical frameworks have illuminated why autoregressive models—including CALM—learn highly structured embedding spaces and enable multi-token acceleration.

Markov Categorical Framework: AR generation is modeled compositionally as Markov kernels in $Stoch$ ; NLL is equated to average KL divergence over kernels. Speculative decoding methods exploit the mutual information surplus in hidden states: $I_D(h_t ; w_{t:t+K-1}) - I_D(h_t ; w_t) \geq 0,$ justifying multi-token draft-and-verify acceleration (Zhang, 25 Jul 2025).
Categorical Entropy: Minimizing NLL compels the model to reproduce the data’s intrinsic stochasticity ( $\bar{H}_D(k_{\mathrm{head}}; p_{H_t})$ convergence), not just deterministic predictions.
Spectral Contrastive Learning: NLL implicitly aligns learned representations with the eigenspectrum of the predictive similarity operator, building geometrically and spectrally structured latent spaces. The Fisher-Rao pullback metric quantifies predictive sensitivity, and the Dirichlet energy of representation organization matches spectral objectives common to self-supervised contrastive learning.
Implicit Continuity: Transformers trained on discrete data naturally learn representations giving rise to continuous-time and continuous-space invariance, with causal attention kernels generalizing from weighted sums to integrals over variable-duration, variable-semantic embedding sequences (Marro et al., 4 Apr 2025).

6. Empirical Performance, Scaling Laws, and Efficiency Implications

CALM demonstrates superior performance-compute trade-offs:

CALM-M (371M params, $K=4$ ) matches Transformer-S (281M params) in BrierLM, with 44% fewer training FLOPs and 34% fewer inference FLOPs (Shao et al., 31 Oct 2025).
In video, VideoMAR delivers SOTA fidelity with only 9.3% of the parameters, 0.5% of the training data, and 0.2% of the GPU resources compared to Cosmos I2V (Yu et al., 17 Jun 2025).
Audio models achieve higher objective and subjective quality at $>2\times$ generation speed (Simon et al., 8 Sep 2025).

A plausible implication is that increasing semantic bandwidth (chunk size $K$ or latent dimension $l$ ) provides a new axis for scaling laws, independent of model and data size—potentially unlocking further efficiency and capacity gains.

7. Design Principles, Future Directions, and Multimodal Extension

CALM principles yield several guidelines:

Adopt high-fidelity, robust autoencoding as foundational.
Model language, audio, and video as sequences of continuous latent vectors with autoregressive/flow-based priors.
Employ strictly proper, likelihood-free scoring rules (energy/Brier) for both training and evaluation.
Integrate flexible, likelihood-free sampling schemes, ensuring control over diversity/accuracy trade-offs.
Exploit curriculum learning and progressive inference strategies for long/sequential modalities.

Potential future directions include scaling K-vector prediction to higher values with larger models, designing more expressive autoencoders/decoders, and extending CALM architectures to further modalities (e.g., image, code). Unified multimodal generative models based on continuous autoregression are anticipated, leveraging shared architectural and theoretical insights.

Key Mathematical Formulations in CALM

Concept	Formula / Mechanism
Autoencoder Objective	$\mathcal{L}_{\text{ae}}(\mathbf{x}_{1:K}) = -\sum_{i=1}^{K} \log p_{\text{dec}}(x_i \| \mathbf{z})$
Variational KL	$\mathcal{L}_{\text{KL}} = -\frac{1}{2} \sum_{i=1}^l (1+\log \sigma_i^2 - \sigma_i^2-\mu_i^2)$
AR Next-Vector	$p(\mathbf{Z})=\prod_{i=1}^{L}p(\mathbf{z}_i \| \mathbf{z}_{<i})$
Energy Score Loss	$\mathcal{L}_{\text{energy}} = \sum_{i=1}^{L} [\frac{2}{N M} \sum_{n,m} \|\|z_{i,m} - \tilde{z}_{i,n}\|\| - \frac{1}{N(N-1)}\sum_{n \neq k} \|\|\tilde{z}_{i,n} - \tilde{z}_{i,k}\|\| ]$
BrierLM Metric	$\text{BrierLM} = 100 \cdot (\prod_{n=1}^4 \text{Brier-}n )^{0.25}$

Continuous Autoregressive LLMs represent a scalable, high-bandwidth, and theoretically rich approach to generative modeling. By shifting language modeling into a continuous latent domain, leveraging robust autoencoding, and employing likelihood-free statistical tools, CALM unlocks new frontiers for LLM scaling, multimodal synthesis, and efficient deployment.