Regression Language Model (RLM) Overview

Updated 1 October 2025

Regression Language Models (RLMs) are neural sequence models that predict continuous values from varied inputs using digit-wise tokenization and transformer architectures.
They unify generative and predictive tasks, reducing the need for hand-engineered features while scaling across multimodal and multitask applications.
RLMs incorporate quantile regression and uncertainty quantification to provide robust, calibrated predictions for scientific and engineering challenges.

A Regression LLM (RLM) is a neural sequence model—frequently Transformer-based—that is structured, trained, or deployed to predict numerical or continuous-valued outputs, either directly as regression targets or as parameters of a predicted distribution, from rich input sequences such as language, code, molecular structures, images, or other modalities. RLMs reformulate regression as a conditional sequence modeling challenge, unifying generative and predictive modeling for scientific, engineering, and analytical applications. The paradigm encompasses direct digit-by-digit sequence regression, embedding-based regression, quantile regression, and multimodal regression strategies, enabling joint modeling of both continuous and discrete variables with strong scalability and multitask capabilities.

1. Fundamental Principles and Motivation

Traditional regression models rely on structured, hand-engineered features or domain-specific representations to map inputs to real-valued responses. In contrast, RLMs operationalize regression by leveraging the expressivity and contextual reasoning skills of LLMs or sequence models. This shift is motivated by several factors:

Inductive Bias for Continuous Properties: Many scientific and engineering tasks—molecular property prediction, protein engineering, hardware/resource prediction, code profiling—require inductive biases that natively incorporate both symbolic sequence context and continuous variables (Born et al., 2022, Akhauri et al., 30 Sep 2025).
Unified Predictive and Generative Capacity: RLMs blur the traditional boundaries between regression (predicting properties) and generation (synthesizing new sequences with target properties) by casting both as conditional sequence completion tasks (Born et al., 2022).
Reduction of Domain-Specific Feature Engineering: By exploiting pretrained sequence models and numerical tokenization, RLMs allow direct use of natural feature spaces (text, code, sequence, images) and capture long-range dependencies and high-dimensional structure (Akhauri et al., 26 Jun 2025).
Scalability Across Modalities and Tasks: RLM architectures and loss formulations generalize across a diverse set of input types, including language, code, tabular, and multimodal inputs (Nguyen et al., 14 Oct 2024, Jennings et al., 20 Jul 2025).

2. Core Architectural and Training Concepts

Sequence-to-Sequence Regression: Many state-of-the-art RLMs employ an encoder–decoder framework wherein input sequences are embedded and processed, and numeric outputs are predicted as sequences of digit tokens. This digit-wise tokenization approach supports unconditional and conditional generation, broad value ranges, and compositional multi-metric prediction. Numeric values are formatted with explicit tokens for sign, exponent, and mantissa, enabling normalization-free optimization (Born et al., 2022, Akhauri et al., 30 Sep 2025, Akhauri et al., 26 Jun 2025).

Alternating and Multi-task Objectives: The multitask nature of RLMs is enabled by alternating training schemes and carefully designed objective functions:

Property Prediction (Regression): Numeric tokens are masked; the model is trained to reconstruct them given context, formalized as maximizing $\mathbb{E}_{z\in \mathcal{Z}_t^p}[\log p_\theta(x^p | x^t)]$ .
Conditional Generation: Given numeric (property) tokens, text/sequence tokens are masked and the model completes them, formalized as $\mathbb{E}_{z\in \mathcal{Z}_t^t}[\log p_\theta(x^t_\text{masked} | x^p, x^t_{\text{unmasked}})]$ .
Self-Consistency Loss: The generated sequence is re-evaluated to ensure the decoded property matches the original primal.
Sequence Tokenization and Embedding: Numerical data are decomposed into digit tokens, potentially with custom embeddings reflecting semantic proximity and decimal position, as in $NE_{\text{Float}}(v, p, j) = (-1)^j \cdot (v \cdot 10^p)/(j + 1)$ , where $v$ is the digit, $p$ the exponent, and $j$ the embedding dimension (Born et al., 2022).

Embedding-based Regression: Pretrained LLM embeddings over string (text, JSON, code) inputs serve as feature vectors for downstream regression models (e.g., MLPs) (Tang et al., 22 Nov 2024, Nguyen et al., 14 Oct 2024). Such embeddings have been shown to maintain Lipschitz continuity with respect to input changes, which supports robust regression in high-dimensional spaces.

Quantile Regression Heads: RLMs may be extended to probabilistic regression by equipping the output layer with quantile regression heads, producing a vector of quantile estimates $(\hat q_{\tau_1}(x), ..., \hat q_{\tau_K}(x))$ for levels $\tau_1, ..., \tau_K$ (Vedula et al., 7 Jun 2025). The pinball loss $L_\tau(\hat q_\tau, y) = \tau(y - \hat q_\tau) + \operatorname{ReLU}(\hat q_\tau - y)$ supports learning the full distribution of the target variable.

Uncertainty Quantification and Probabilistic Outputs: RLMs can be constructed to estimate both heteroscedastic (input-dependent) uncertainty and epistemic uncertainty (via model ensembling and Bayesian inference), outputting parameterized distributions (e.g., Gaussian) for robust prediction under label noise or ambiguous inputs (Hasan et al., 5 Aug 2025).

3. Performance Metrics and Empirical Evaluation

RLM performance is quantified using both regression accuracy and distributional calibration metrics, with evaluation tailored to task and domain:

Metric	Description	Application Domains
RMSE, MAE, MAPE, WAPE	Standard errors for regression accuracy	Molecules, code, systems (Born et al., 2022, Akhauri et al., 30 Sep 2025, Akhauri et al., 26 Jun 2025)
Pearson / Spearman / Kendall	Correlation/rank correlation of predictions	Sequence, code, ranking (Akhauri et al., 30 Sep 2025, Born et al., 2022)
Calibration Error (CE)	Quantifies probabilistic prediction quality	Distributional regression (Vedula et al., 7 Jun 2025, Hasan et al., 5 Aug 2025)
CRPSS, RCIW	Evaluate cumulative distribution calibration	Price prediction, risk (Vedula et al., 7 Jun 2025)
Explained NLL/R²	Likelihood-ratio pseudo- $R^2$ for densities	Systems, density estimation (Akhauri et al., 26 Jun 2025)
0-Variance, Perplexity	Influence of conditioning in generation	Property-driven molecular design (Born et al., 2022)

Empirical studies demonstrate that digit-wise, sequence-to-sequence RLMs maintain stability over wide value ranges and outperform MSE-based baselines on code-to-metric prediction, molecular property regression, and large systems resource prediction (Born et al., 2022, Akhauri et al., 30 Sep 2025, Akhauri et al., 26 Jun 2025). Embedding-based methods are particularly robust in high-dimensional settings, where traditional feature engineering fails (Tang et al., 22 Nov 2024, Nguyen et al., 14 Oct 2024). Quantile regression RLMs achieve superior calibration relative to point-estimate and embedding-based models, with strong gains on price prediction tasks (Vedula et al., 7 Jun 2025). In multimodal regression (e.g., images), regression via transformer-based classification with flexible binning and semantic prompts establishes new state-of-the-art results (Jennings et al., 20 Jul 2025).

4. Multimodal and Multitask Extensions

RLMs naturally generalize to complex settings involving:

Multimodal Inputs: By pairing specialized tokenizers and encoders (e.g., vision encoder + textual decoder), RLMs extend to image-based regression (aesthetics, quality assessment), code-metric, and trajectory prediction via LLM tokenization and generation (Lukasik et al., 7 Mar 2024, Jennings et al., 20 Jul 2025, Akhauri et al., 30 Sep 2025).
Conditional and Joint Modeling: Jointly encode sequence structure, discrete labels, and continuous values (e.g., molecule string + property, code + memory/latency, configuration/log data + resource usage) to enable both prediction and controlled generation (Born et al., 2022, Akhauri et al., 26 Jun 2025).
Prompt-Based Personalization: Tasks such as recommendation and user-personalized regression deploy prompt-based frameworks with gradient-based automated prompt search (potentially adding user-specific trigger tokens) without modifying the underlying LLM parameters (Li et al., 1 Feb 2024).
Meta-task Control: Using data-specific or semantically meaningful prompts, MLLMs/RLMs can “gate” regression heads or tasks, unifying multiple regression objectives under a common language interface (Jennings et al., 20 Jul 2025).

5. Challenges, Limitations, and Empirical Insights

Key considerations in deploying and scaling RLMs include:

Numerical Tokenization: It is crucial to design tokenization and embedding strategies that maintain both the precision and sequential semantics of numbers, as irregular tokenization can degrade performance in regression tasks (Lukasik et al., 7 Mar 2024).
Model Capacity and Rank Limitations: For models targeting distributions over strings (e.g., regular LLMs, RLMs in the formal sense), theoretical and empirical evidence shows that model capacity (hidden state size) must match or exceed the rank of the induced automaton for exact learnability (Borenstein et al., 6 Jun 2024).
Calibration and Uncertainty: High-performing RLMs support both distributional and point estimation goals, with specialized heads and loss components penalizing degenerate uncertainty (e.g., overconfident or underconfident predictions) (Hasan et al., 5 Aug 2025).
Scaling and Proxy-based Regression: Automated data mixture selection for large-scale pretraining, framed as a regression mapping from domain proportions to held-out performance, allows resource-efficient optimization (rank invariance hypothesis) (Liu et al., 1 Jul 2024). Domain interactions can be complex and non-monotonic, requiring regression-based rather than human-curated strategies for optimality.
Adapting to New Tasks and Modalities: Encoder–decoder architectures and flexible input tokenizations enable rapid adaptation via few-shot learning, cross-modality, and transfer learning, unlike specialized tabular regression (Akhauri et al., 26 Jun 2025, Nguyen et al., 14 Oct 2024).

6. Applications and Future Directions

RLMs are deployed across scientific discovery, engineering, and platform optimization:

Molecular and Protein Design: Both regression and conditional generation of chemical or protein sequences for desired properties, leveraging compositional and inductive bias (Born et al., 2022).
Hardware/Software Performance Modeling: Code-to-metric regression for memory profiling, latency estimation, and neural architecture search, supporting both high-level languages and graph representations (e.g., ONNX) (Akhauri et al., 30 Sep 2025).
Resource and System Outcome Prediction: Large-scale performance prediction from system logs/configurations, with substantial improvements over tabular methods in industrial settings (Akhauri et al., 26 Jun 2025).
Personalized and Multitask Regression: Automated prompt optimization for recommendation and personalized regression tasks (Li et al., 1 Feb 2024).
Multimodal Quality and Aesthetics Assessment: Image-based regression, enhanced by semantic textual prompts, and quality measures (Jennings et al., 20 Jul 2025).
Symbolic Regression with Domain Constraints: LLM-driven symbolic equation discovery in physics, coupled with dimensional analysis to enforce physical consistency (Zhu et al., 17 Jun 2024).
Probabilistic and Quantile Regression: Predictive distribution estimation, including quantile regression heads for uncertainty-aware tasks (Vedula et al., 7 Jun 2025, Hasan et al., 5 Aug 2025).

Further directions include fully universal in-context RLMs across arbitrary domains and modalities, formal learning-theoretic analysis of sequence-to-regression learnability, continual/dynamic regression task specification via natural language, and integrated frameworks unifying point estimates, distributions, and uncertainty quantification within a single model.

7. Foundational Codebases and Open-Source Resources

The proliferation and reproducibility of RLMs is supported by open-source resources and toolkits:

Resource	Description	Link
Regression Transformer (RT)	XLNet-based multitask RLM for biochemistry	github.com/IBM/regression-transformer (Born et al., 2022)
LMTraj	Language-based trajectory prediction	github.com/inhwanbae/LMTrajectory (Lukasik et al., 7 Mar 2024)
RegMix	Data mixture regression pretraining	github.com/sail-sg/regmix (Liu et al., 1 Jul 2024)
PAP-REC	Automatic prompt search for RLMs	github.com/rutgerswiselab/PAP-REC (Li et al., 1 Feb 2024)
Price Quantile Regression	LLM quantile regression for prices	github.com/vnik18/LLM-price-quantile-reg/ (Vedula et al., 7 Jun 2025)
Embed-Then-Regress	LLM embeddings for Bayesian Optimization	github.com/google-research/optformer/tree/main/optformer/embed_then_regress (Nguyen et al., 14 Oct 2024)

The convergence of advances in sequence modeling, numerical tokenization, multitask training, and open resources has positioned Regression LLMs as standard tools for continuous-valued prediction, generation, and calibration across scientific and engineering domains.