Continuous Autoregressive Language Models (2510.27688v1)

Published 31 Oct 2025 in cs.CL, cs.AI, and cs.LG

Abstract: The efficiency of LLMs is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive LLMs (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9\% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain. Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost. More importantly, these findings establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient LLMs. Code: https://github.com/shaochenze/calm. Project: https://shaochenze.github.io/blog/2025/CALM.

Summary

The paper introduces CALM, which compresses token sequences into continuous vectors, reducing the number of generative steps while preserving reconstruction accuracy.
It integrates a robust autoencoder with a Transformer-based architecture and an energy-based objective to bypass traditional likelihood bottlenecks.
The approach significantly improves computational efficiency, offering scalable potential for deploying language models in resource-constrained environments.

Continuous Autoregressive LLMs

Introduction

Continuous Autoregressive LLMs (CALM) propose a transformation of traditional language modeling from discrete token prediction to continuous vector prediction, with the aim of increasing computational efficiency. The paradigm utilizes a high-fidelity autoencoder to compress sequences of $K$ tokens into single continuous vectors. This compression reduces the number of generative steps required, significantly decreasing computational demands while maintaining high accuracy in token reconstruction. CALM circumvents the traditional autoregressive bottleneck by enabling models to handle larger semantic units per step.

CALM Framework

The CALM framework is built upon several key components:

High-Fidelity Autoencoder: The autoencoder is designed to map a chunk of K discrete tokens into a single continuous vector. The architecture employs a context-free design for efficient processing and is optimized to ensure that the compressed representation can be nearly perfectly reconstructed into the original token sequence.
Likelihood-Free Modeling: Shifting to continuous representations, CALM introduces a framework that eschews the need for explicit likelihood calculations, thereby overcoming the limitations posed by the softmax function in large vocabulary settings. This is achieved using an energy-based objective that governs vector generation without explicit probability distributions.
Robust Autoencoder Training: To ensure robustness of the latent space, the autoencoder incorporates variational techniques, employing KL divergence regularization alongside dropout mechanisms. These strategies prevent the latent representation from becoming brittle, thereby enhancing the model's ability to handle perturbations during generation.

Implementation Details

The CALM model is implemented by integrating an autoencoder into a Transformer-based architecture:

Autoencoder Design:
- Encoder: Compresses token sequences into dense vectors using a series of fully connected layers followed by a linear projection to the latent space.
- Decoder: Maps latent vectors back into token sequences, utilizing a mirrored architecture to the encoder with tied embeddings.
Energy-Based Generative Head: Combines a transformer-generated hidden state with a stochastic noise input to produce continuous vectors. Training leverages an energy score to ensure high fidelity and diversity in the generated representations.
Temperature Sampling: To accommodate generation variability, CALM employs a novel temperature sampling method using rejection sampling principles, allowing for controlled sampling without explicit likelihoods.

Performance & Trade-offs

CALM achieves a substantial improvement in computational efficiency compared to conventional token-based models. Despite a reduction in per-step information redundancy, CALM maintains competitive performance metrics, achieving similar or better scores on standard language tasks with far lower computational costs. The model operates effectively at various scales of semantic bandwidth $K$ , revealing a new axis for model optimization beyond traditional parameters and dataset size scaling.

Applications and Future Work

CALM's framework, especially its robust autoencoder, opens pathways for more efficient deployment of LLMs in resource-constrained settings. Future research can explore:

Scalability: Investigating the scalability of CALM models and their ability to handle increasingly larger contexts with better efficiency.
Integration with Other Models: Potential integration with reinforcement learning or other generative frameworks that benefit from a continuous representation strategy.
Advancements in Autoencoder Design: Further developing more semantically informed autoencoders for richer latent representations.

Conclusion

The Continuous Autoregressive LLMs framework offers a promising approach to overcoming inefficiencies in traditional language modeling by enhancing the semantic bandwidth per generative step. This introduces a scalable path forward for developing ultra-efficient LLMs capable of sustaining high performance while significantly reducing computational demands.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview: What is this paper about?

This paper introduces a new way to make LLMs faster and more efficient. Today’s LLMs write text one small piece at a time (usually one token, like a word part). That’s slow. The authors propose “CALM” — Continuous Autoregressive LLMs — which let the model think and generate in bigger chunks. Instead of predicting the next tiny token, CALM predicts the next “vector” (a bundle of numbers) that represents several tokens at once. This cuts the number of steps needed and saves a lot of compute while keeping quality high.

Key Questions the paper asks

Can we pack several tokens into one continuous vector and still recover the original text almost perfectly?
If we stop using fixed vocabularies and softmax probabilities, how do we train and evaluate a model that predicts continuous vectors instead?
Can we control how “random” or “creative” the model is (temperature) without having direct access to probabilities?
Will this approach actually make LLMs cheaper and faster while keeping their performance strong?

How they did it (methods explained simply)

Think of the model like a writer with a better backpack:

Instead of carrying one tiny word-piece per step, CALM packs 3–4 tokens together into one compact “vector” and carries that. This shortens the trip.

Here are the main parts of the approach:

1) An autoencoder that “zips” and “unzips” token chunks

Autoencoder: Imagine a “zipper” for text. It compresses a small group of tokens (like 4 at a time) into one vector, and then unzips it back to the exact tokens later.
High accuracy: Their zipper is very accurate — over 99.9% of the original tokens come back correctly after unzipping.
Robustness: If the vector is slightly off, the unzipper should still recover the right tokens. To make it robust, they:
- Add smooth noise during training (like learning to read a slightly smudged note).
- Use a “variational” trick that keeps the vectors in a calm, well-organized space rather than a brittle, overpacked one.
- Apply dropout (temporarily hide parts) so the model learns to handle missing or noisy bits.

2) Predicting the next vector instead of the next token

Traditional models pick the next token from a fixed vocabulary using a softmax layer. CALM can’t do that — there’s no fixed list to choose from, and vectors live in an infinite continuous space.
So, CALM uses a Transformer to build context, then a lightweight “generative head” that outputs the next vector in one shot (no slow, multi-step sampling).
Training without traditional probabilities: Instead of maximizing “likelihood” (which needs softmax), they use the “energy score.” Think of it like throwing darts at the true target. The model gets better scores if its samples:
- Are close to the real target (accuracy), and
- Aren’t all identical (diversity).
This “energy score” works just by drawing samples — no probability math required.

3) Evaluating the model without perplexity

Perplexity needs exact probabilities, which CALM doesn’t have. So they designed “BrierLM,” a fair, likelihood-free score.
Simple idea: Ask the model for a couple of guesses and check:
- How often they match the ground truth, and
- How often the two guesses match each other (to measure confidence vs. overconfidence).
BrierLM is built from these checks and correlates strongly with traditional cross-entropy, so it’s a trustworthy replacement.

4) Temperature control without probabilities

Temperature changes how adventurous the model is: low temperature = safe and predictable; high temperature = more creative.
CALM doesn’t have logits to scale, so they created a sampler-only method:
- For certain temperatures, they accept a sample only if multiple independent draws agree — this effectively “raises” probabilities to a power, like traditional temperature does.
- They generalize this idea to any temperature using a careful accept/reject procedure.
Bottom line: You can still turn the “creativity knob” even with a sampler-only model.

Main Findings and Why They Matter

Fewer steps, similar quality: Grouping K=4 tokens into one vector makes the model generate four times fewer steps, while keeping performance competitive with strong standard models.
Strong compute trade-off: CALM matches the quality of discrete-token baselines but uses significantly less compute. That’s great for speed, cost, and energy use.
Ultra-fidelity compression: The autoencoder reconstructs tokens with more than 99.9% accuracy — so you don’t lose details when packing multiple tokens into one vector.
Practical toolkit: Training, evaluating, and sampling all work without needing explicit probabilities. This makes continuous next-vector prediction practical at scale.

What this could mean going forward

Faster, cheaper LLMs: By predicting bigger chunks each step, models can generate long text more quickly and at lower cost. This could make high-quality language tools more accessible to schools, small companies, and researchers.
A new way to scale: Instead of only making models bigger, we can increase how much meaning each step carries. That’s like upgrading from typing letter-by-letter to typing phrase-by-phrase.
Broader applications: The same ideas—continuous representations and sampler-only training—could help with other data types (like audio or images) and other generative models that don’t use softmax.
Greener AI: Cutting compute means saving energy, which helps reduce the environmental impact of training and running large models.

In short, CALM shows that predicting continuous vectors instead of discrete tokens can be a powerful path to building faster, more efficient LLMs without sacrificing quality.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of concrete gaps and open questions that the paper leaves unresolved and that future work could directly address.

Context-free autoencoder limitation: The autoencoder encodes chunks independently of surrounding context. How much would a context-aware encoder/decoder (conditioning on previous vectors or broader context) improve robustness, compression fidelity, and downstream LM performance?
Scaling laws for K and latent dimension l: The paper primarily reports results for K=4 and l=128. What are the empirical and theoretical scaling relationships among K, l, model size, and performance (e.g., reconstruction accuracy, BrierLM, speedup) for K ∈ {2,4,8,16,...}? What l is minimally sufficient for each K to maintain robustness and low error accumulation?
Chunk boundary effects: Only non-overlapping chunks are used. How do overlapping chunking, variable-length chunks, or boundary-aware segmentation (e.g., aligning with word/sentence boundaries) affect coherence, error propagation within a chunk, and long-range dependency modeling?
Error accumulation within chunks: Predicting K tokens at once prevents intra-chunk correction. What is the chunk-level error profile (e.g., conditional error rates for positions 1..K) and its impact on downstream tasks as sequence length grows?
Input modality mismatch: The formalism initially feeds z-vectors to the Transformer, but the final design “grounds” inputs in discrete tokens via an MLP compressor. What is the quantitative impact of vector-only, token-only, and hybrid inputs (e.g., concatenating compressed tokens with z) on quality and efficiency?
End-to-end co-training: The autoencoder is pre-trained and then used with a frozen decoder; gradients do not flow through the argmax decoding step. Can joint or alternating training (e.g., straight-through estimators, RL-style credit assignment, or differentiable relaxations) co-shape the latent space and the LM for better performance?
Information-theoretic analysis: The paper argues for increased “semantic bandwidth” per step but provides no bits-per-step or capacity analysis. What is the effective information rate of z (considering l, variance, and decoder redundancy), and how does it compare to discrete token logit entropy and compute costs?
Latent robustness beyond in-distribution: Robustness is demonstrated via VAE regularization and dropout, yet out-of-distribution behavior (domain shift, multilingual scripts, code, noisy inputs) is untested. How stable is decoding under distribution shift and adversarial or compounding generative errors?
Decoder constraints and controllability: The decoder deterministically argmaxes K tokens from z. How can one enforce format constraints (JSON, grammar), structured outputs, or targeted lexical control without explicit token-level probabilities? Are top-k/top-p analogs feasible in the likelihood-free regime?
Likelihood-free temperature sampling practicality: The exact rejection-based algorithm lacks a practical acceptance-rate analysis in the main text, especially for small T (low temperature). What are the expected sampler calls and wall-clock costs under realistic token distributions, and how effective/accurate is the proposed “highly efficient batch approximation” (details not provided)?
Alternative controllable sampling methods: Only temperature is addressed. Can likelihood-free analogs of nucleus sampling, top-k, classifier-free guidance, or conditional constraints be developed with comparable control fidelity and efficiency?
BrierLM validity and variance: BrierLM uses two samples and indicator functions. What is its variance, sample efficiency, and sensitivity to the number of samples? How does BrierLM correlate with human judgments (e.g., helpfulness, coherence) beyond its correlation with cross-entropy?
Metric confounding via autoencoder: For CALM, BrierLM requires decoding via the autoencoder, potentially conflating LM quality with AE reconstruction artifacts (even if rare). How sensitive is BrierLM to small AE errors or latent-space idiosyncrasies?
Comprehensive benchmarking: The paper claims efficiency and quality gains but omits detailed benchmarks in the main text (datasets, languages, model scales, wall-clock speedups, KV-cache/memory savings, long-context tests). A thorough, apples-to-apples evaluation is needed, including large-scale settings and diverse domains.
Long-context generalization: Reducing sequence length by K does not guarantee preservation of long-range dependencies. How does CALM perform on tasks requiring context windows of 100k+ tokens (summarization, long QA), and what is the effect on KV cache size and attention quality?
Training cost vs. runtime savings: Training uses N model samples per step (N=8) and M target draws from the posterior (M=100). What is the end-to-end training compute compared to standard cross-entropy LMs, and how does it scale with model size and K?
Energy head architectural choices: The energy-based head uses uniform noise and residual MLPs with specific depth/width ratios. Are there principled design guidelines (e.g., effect of noise dimension, noise distribution, head depth) and ablations showing optimal configurations?
Comparison to iterative continuous models: Diffusion/flow matching are dismissed for iterative sampling costs, but quantitative comparisons (quality-speed Pareto) are not provided. How does CALM compare at matched compute with state-of-the-art iterative continuous LMs?
Relationship to discrete compression methods: How does CALM compare to token-bundling, larger vocabularies, hierarchical LMs, VQ-VAE/discrete latents, or byte-level models in terms of throughput, controllability, and quality at scale?
Alignment and preference optimization: Many alignment methods (e.g., RLHF, DPO) rely on log-probabilities. How can preference optimization be performed in a likelihood-free CALM (e.g., via ranking-based or score-based objectives) without access to token-level logprobs?
Safety and calibration: Energy-score training encourages diversity, but its impact on factuality, hallucination rates, and calibration (under- vs over-confidence) is unclear. How does CALM behave under safety and truthfulness evaluations?
Tokenization edge cases: How does the approach handle subword boundaries across chunks, special tokens (BOS/EOS), and languages with complex morphology or non-Latin scripts? Are there failure modes when a semantic unit straddles chunk boundaries?
Quantization and deployment: z is a real-valued vector. What are the effects of quantizing z (and the energy head) for on-device or low-bandwidth deployment? How many bits are needed to preserve AE fidelity and LM quality?
Error analysis granularity: The claim of 99.9% reconstruction accuracy lacks a breakdown by token frequency, POS, or domain. Which token types drive the residual 0.1% errors, and how do these errors impact downstream generation?
Theoretical guarantees for energy training: While strictly proper scoring rules are invoked, convergence properties, optimization stability, and mode coverage of the energy loss in high-dimensional latent spaces are not theoretically characterized for language.
Data and reproducibility details: Key training details (corpus composition, vocabulary size, language coverage, hyperparameters at scale) and code pointers for BrierLM/temperature sampling implementations are light in the main text, limiting reproducibility and independent verification.

View Paper Prompt View All Prompts

Practical Applications

Practical, Real-World Applications of Continuous Autoregressive LLMs (CALM)

Below, we outline actionable applications derived from the paper’s findings: continuous next-vector prediction via a robust autoencoder; a single-step energy-based generative head; the likelihood-free BrierLM evaluation metric; and likelihood-free temperature sampling. Each item includes sector, use case, potential tools/products/workflows, and assumptions or dependencies.

Immediate Applications

These applications can be piloted or deployed now, leveraging the released code and the demonstrated K=4 grouping with high-fidelity reconstruction.

Sector: Cloud AI/LLM Platforms
- “CALM Inference Accelerator” module: wrap existing Transformer decoders with the CALM autoencoder + energy-based generative head.
- GPU/TPU inference server updates to batch the energy-head sampling and AE decoding.
- Assumptions/dependencies:
- Quality parity holds for target domains (paper reports comparable performance to strong baselines with lower compute).
- Integration work to ensure compatibility with serving stacks, streaming APIs, and safety layers that currently expect logits.
Sector: Mobile/Embedded AI (Edge)
- Lightweight CALM variants with quantized autoencoder and compact energy head.
- Edge SDK offering chunked generation APIs and a frozen AE decoder.
- Assumptions/dependencies:
- Robustness of latent vectors under quantization; availability of sufficient memory for the AE and generative head.
- K must be tuned to balance latency vs reconstruction fidelity on constrained hardware.
Sector: Real-Time Operations (Customer Support, Meetings, Contact Centers)
- Chunk-level streaming: send reconstructed K-token chunks over SSE/WebSockets.
- Hybrid pipeline: discrete token input compression MLP + continuous prediction + AE decode.
- Assumptions/dependencies:
- User experience remains acceptable with chunked outputs; system latency benefits outweigh engineering complexity.
- Robustness of the decoder under noisy continuous predictions during high-concurrency loads.
Sector: Enterprise Document Processing (Legal, Finance, Operations)
- CALM-enabled batch workers for long-form documents.
- Job schedulers updated to prioritize CALM models for large workloads.
- Assumptions/dependencies:
- Domain-specific prompting and evaluation show parity with token-based baselines.
- Minimal retraining or fine-tuning required due to AE’s high fidelity.
Sector: Software Engineering
- IDE plugins that swap discrete next-token with CALM next-vector generation for short spans (e.g., K=4–8).
- Assumptions/dependencies:
- Reconstruction accuracy must stay extremely high for code tokens (99.9% token accuracy is promising, but cumulative error over long sequences is a risk).
- CI gates and tests remain the primary safety net; prompt/completion lengths tuned to minimize error propagation.
Sector: Healthcare (Clinical Documentation)
- Local CALM deployments for PHI data, coupled with AE decoding on-prem.
- Assumptions/dependencies:
- Privacy constraints favor on-prem inference; model governance accommodates likelihood-free sampling and evaluation.
- Clinical validation for accuracy on medical language.
Sector: Education
- CALM-based tutoring bots configured for chunked generation, with configurable K to optimize cost and quality.
- Assumptions/dependencies:
- Reliability for educational correctness; alignment with safety policies; integration with LMS platforms.
Sector: Robotics and IoT
- Command-only CALM models with small generative heads; short chunk sizes for predictable latency.
- Assumptions/dependencies:
- Robustness to noisy latent predictions; bounded output lengths; real-time constraints.
Sector: Research & Benchmarking (Academia and R&D Labs)
- BrierLM evaluation toolkit that plugs into training loops; “Brier-n” metrics for sequence-level assessment.
- Assumptions/dependencies:
- Community acceptance of BrierLM as a principled, strictly proper alternative to perplexity; reproducible correlation with cross-entropy across tasks.
Sector: Model Safety, Compliance, and Content Moderation
- “Likelihood-free temperature controller” library with batch approximations for production use.
- Assumptions/dependencies:
- Expected sampling cost is manageable at target temperatures; approximate (batch) variants may be required for low T.
- Safety tooling may need to adapt from logit-level to sampler-level controls.

Long-Term Applications

These applications require further research, scaling, or ecosystem development before broad deployment.

Sector: Foundation Models and Model Scaling
- Next-gen continuous LMs with robust, context-aware autoencoders; specialized training regimes for long K.
- Assumptions/dependencies:
- Continued advances in robust latent manifolds; avoidance of brittleness/posterior collapse at higher K; sustained reconstruction fidelity.
Sector: Specialized Hardware and Systems
- ASICs/accelerators tuned for residual MLP stacks and AE operations; memory layouts optimized for chunked sequences.
- Assumptions/dependencies:
- Sufficient industry adoption to justify silicon; standardized “vector-chunk” interfaces.
Sector: Multimodal AI (Text, Audio, Vision)
- Multimodal autoencoders mapping short spans of different modalities into composable vectors; single-step energy heads for mixed outputs.
- Assumptions/dependencies:
- High-fidelity multimodal reconstruction; robust cross-modal latent alignment; new training objectives extending energy score to multimodal distances.
Sector: Privacy-Preserving and Federated AI
- Federated CALM pipelines where clients produce continuous vectors; servers decode and moderate.
- Assumptions/dependencies:
- Formal privacy guarantees for latent spaces; secure protocols; acceptable latency and fidelity under noisy channels.
Sector: Standardization and Interoperability
- Open standards for AE formats, vector dimensionalities, and decoding semantics; inter-model adapters.
- Assumptions/dependencies:
- Community consensus; compatibility across tokenizers and vocabularies; governance by standards bodies.
Sector: Safety, Alignment, and RLHF for Continuous LMs
- RLHF pipelines leveraging strictly proper scoring rules; latent-space safety classifiers and filters.
- Assumptions/dependencies:
- Reliable mapping from latent constraints to token-level behaviors; human feedback protocols adapted to chunk-level evaluation.
Sector: Sustainability and Policy
- “Green AI” audits reporting energy savings from CALM-style architectures; standardized metrics (e.g., BrierLM + energy-per-output).
- Assumptions/dependencies:
- Trusted measurement frameworks; cross-vendor verification; policy-maker acceptance of likelihood-free metrics.
Sector: Serverless and Cost-Aware AI Platforms
- CALM-aware orchestration with per-chunk metering; adaptive K tuning per workload.
- Assumptions/dependencies:
- Mature ops tooling for continuous LMs; transparent quality-cost trade-offs across K values.
Sector: Data Compression and Transport (Research)
- AE variants with discrete latents or learned quantization; error-correction schemes for noisy channels.
- Assumptions/dependencies:
- Competitive compression ratios versus standard codecs; acceptable reconstruction fidelity; efficient integer-based latents.
Sector: Evaluation Ecosystem
- Community benchmarks where BrierLM is primary or complementary to perplexity; public leaderboards supporting implicit models.
- Assumptions/dependencies:
- Methodological consensus; extensions for domain-specific scoring; tooling to ensure reproducibility and comparability.

In summary, CALM introduces a practical new axis for scaling LLM efficiency—semantic bandwidth per step—alongside a toolkit (robust autoencoder, energy-based single-step head, BrierLM, and likelihood-free temperature sampling) that can be adopted immediately in constrained settings and extended for transformative gains over the longer term.

View Paper Prompt View All Prompts

Glossary

Argmax: An operation that returns the index of the maximum value, used to select the most likely tokens from logits. "Finally, the tokens are reconstructed by applying an argmax operation over these logits."
Autoencoder: A neural model that compresses inputs into a latent vector and reconstructs them, here mapping K tokens to a continuous vector. "The foundational component of our CALM framework is an autoencoder tasked with learning a bijective mapping between a chunk of $K$ discrete tokens and a continuous vector."
Autoregressive: A generative modeling approach where each output is conditioned on previous outputs, producing sequences step by step. "an autoregressive generation process that operates on a sequence of discrete tokens."
Bernoulli Factory: A method to simulate a coin with probability f(p) using samples from a coin with unknown probability p; used to realize fractional temperature exponents. "we draw upon the theory of Bernoulli Factory \citep{10.1145/175007.175019,MENDO20194366} to construct an iterative procedure that simulates a biased coin flip with a success probability of $P(x)^\alpha$ ."
Brier score: A strictly proper scoring rule that evaluates probabilistic predictions by balancing accuracy and uncertainty. "the Brier score is defined as:"
BrierLM: A likelihood-free language modeling metric built from Brier scores over n-grams, used to evaluate models without explicit likelihoods. "We address this by proposing BrierLM, a novel metric for language modeling based on the Brier score \citep{brier1950verification}."
CBOW (Continuous Bag-of-Words): A representation-learning method that predicts a token from its context; used here as an analogy for token masking during training. "Analogous to the Continuous Bag-of-Words (CBOW) method \citep{mikolov2013efficientestimationwordrepresentations},"
Cross-entropy loss: A standard supervised learning objective that matches predicted distributions to targets; used for autoencoder reconstruction. "optimizing the standard cross-entropy loss across all $K$ token positions:"
Diffusion: An iterative generative modeling framework that refines noise into samples through many steps, often computationally heavy. "options like Diffusion \citep{NEURIPS2020_4c5bcfec,li2024autoregressive} or Flow Matching \citep{lipman2023flow} rely on an iterative sampling process,"
Energy loss: A likelihood-free training objective based on the Energy Score, estimated via Monte Carlo sample distances. "we can construct an unbiased Monte Carlo estimator to serve as a practical loss function, which we term the energy loss."
Energy Score: A strictly proper scoring rule for probabilistic forecasts that compares sample distances to the observation. "We build our training objective using the Energy Score \citep{energy}, a strictly proper scoring rule"
Energy Transformer: A model architecture enabling single-step generation of continuous vectors via energy-based objectives. "our framework therefore specifically adopts the Energy Transformer \citep{shao2025continuous}, a recent architecture designed for efficient, single-step generation of continuous vectors,"
Evidence Lower Bound (ELBO): A variational objective that lower-bounds log-likelihood; often used to evaluate implicit generative models. "often relying on the complex and sometimes loose estimation of variational lower bounds (ELBOs) to approximate Perplexity."
Flow Matching: A generative technique that transports distributions via learned vector fields, typically requiring iterative sampling. "options like Diffusion \citep{NEURIPS2020_4c5bcfec,li2024autoregressive} or Flow Matching \citep{lipman2023flow} rely on an iterative sampling process,"
Implicit generative model: A model defined by its sampling procedure rather than explicit probability densities. "The CALM framework operates as an implicit generative model, whose predictive probability distribution is defined by its sampling process."
KL clipping: A regularization strategy that clips per-dimension KL loss to prevent collapse of latent variables. "we adopt the KL clipping strategy from \citet{NIPS2016_ddeebdee},"
KL divergence: A measure of divergence between distributions, used to regularize the latent posterior toward a prior. "a KL divergence loss that penalizes the deviation of the encoded distribution from a standard normal prior,"
Latent manifold: The geometric structure of the latent space; smoothness improves robustness to perturbations. "Lacking any incentive to form a smooth latent manifold, the encoder learns to pack information with maximum efficiency,"
Latent vector: A continuous representation encoding information about a token chunk, produced by the encoder. "the latent vector $\mathbf{z}$ effectively perturbs the predicted mean $\bm{\mu}$ with a substantial Gaussian noise $\bm{\sigma} \approx 0.3 \mathbf{I}$ ."
Likelihood-free framework: Methods for training and evaluation that do not rely on explicit likelihoods. "we develop a comprehensive, likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain."
Perplexity: A conventional language-model metric derived from likelihood, unsuitable when likelihoods are intractable. "The absence of explicit likelihoods makes traditional metrics like Perplexity inapplicable."
Posterior collapse: A failure mode of VAEs where latent variables ignore inputs and match the prior, harming reconstruction and downstream learning. "A significant challenge in training VAEs is posterior collapse."
Rejection sampling: A technique that accepts proposed samples with a probability to realize a target distribution. "developing an exact algorithm, grounded in the principles of rejection sampling, that provably achieves this goal."
Strictly proper scoring rule: A scoring function maximized only when the predicted distribution equals the true data distribution. "a scoring rule is strictly proper if equality holds only when $P=Q$ "
SwiGLU: A gated activation function variant that improves Transformer blocks’ expressivity and training. "This is followed by a SwiGLU layer \citep{shazeer2020gluvariantsimprovetransformer} with an intermediate dimension of $d$ ."
Teacher-forcing: An evaluation/training regime where the model is conditioned on ground-truth previous tokens. "A straightforward approach is to assess next-token prediction performance in a teacher-forcing setting."
Tied input embedding matrix: Sharing weights between input embeddings and output projection to vocabulary logits. "followed by a projection to vocabulary logits using the tied input embedding matrix."
Transformer backbone: The primary sequence model that produces hidden states conditioning the generative head. "We use a standard Transformer backbone, with modifications focused on the output-side generative head and the input-side adaptation."
Variational Autoencoder (VAE): A stochastic autoencoder that samples latent variables from a learned posterior regularized by a prior. "A significant challenge in training VAEs is posterior collapse."

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (4)

Collections

GitHub

Tweets

This paper has been mentioned in 75 tweets and received 8927 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

YouTube

Show All Videos

HackerNews

Continuous Autoregressive Language Models (3 points, 1 comment)
Continuous Autoregressive Language Models (3 points, 0 comments)
Calm: Continuous Autoregressive Language Models (2 points, 0 comments)
Continuous Autoregressive Language Models (2 points, 0 comments)
CALM: Continuous Autoregressive Language Models (1 point, 0 comments)

Continuous Autoregressive Language Models, Shao et al. 2025 (19 points, 1 comment)

Continuous Autoregressive Language Models (2510.27688v1)

Summary

Continuous Autoregressive LLMs

Introduction

CALM Framework

Implementation Details

Performance & Trade-offs

Applications and Future Work

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview: What is this paper about?

Key Questions the paper asks

How they did it (methods explained simply)

1) An autoencoder that “zips” and “unzips” token chunks

2) Predicting the next vector instead of the next token

3) Evaluating the model without perplexity

4) Temperature control without probabilities

Main Findings and Why They Matter

What this could mean going forward

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Practical, Real-World Applications of Continuous Autoregressive LLMs (CALM)

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections

GitHub

Tweets

YouTube

HackerNews

Reddit