Text-to-Text Regression

Updated 1 July 2025

Text-to-Text Regression is a class of methods that maps text inputs to continuous outcomes, treating both input and sometimes output as textual data to handle high-dimensional linguistic features.
This approach is applied across diverse fields for tasks such as sentiment analysis, system performance prediction, and semantic similarity assessment.
Frameworks like inverse regression, decoding-based models, and prompt-based methods are used, addressing the challenges of high-dimensional text data through tailored estimation and loss functions.

Text-to-text regression is a class of predictive modeling methods where both the input and the output are represented as text, with the objective of learning a mapping from high-dimensional textual data to continuous outcomes or metrics. This paradigm extends classical regression to domains where textual documents serve as predictors and, in some approaches, where the target is also expressed or interpreted as text, such as real numbers tokenized as sequences or semantic similarity scores. It encompasses a diverse set of frameworks arising from statistics, machine learning, and natural language processing, designed to address challenges inherent in high-dimensional, unstructured, or non-tabular data sources.

1. Frameworks and Problem Formulations

A variety of frameworks have been developed to perform regression using text data, each tailored to overcome specific statistical and computational barriers:

Inverse Regression and Sufficient Reduction: Multinomial Inverse Regression (MNIR) models the conditional distribution of token counts given a response variable (e.g., sentiment), enabling estimation of sufficient low-dimensional representations rich in sentiment or outcome information (Taddy, 2010). This "sentiment-sufficient dimension reduction" allows recovery of key summary statistics from high-dimensional text, optimizing for predictive sufficiency relative to a continuous outcome.
Direct Text Encoding Universal Regressors: The OmniPred paradigm treats regression as a mapping from parameter and context descriptions, fully encoded as text, to outcome metrics similarly tokenized as text strings, thereby learning over arbitrary (heterogeneous or hierarchical) input spaces (Song et al., 22 Feb 2024).
Decoding-based Regression: Causal sequence models are trained to predict numeric outcomes by auto-regressively generating the tokens of a number (for example, as base-10 digits), relying purely on cross-entropy loss over output strings rather than numeric losses or point estimates. This turns regression into a text generation problem, establishing theoretical parity with traditional regression while allowing the model to capture arbitrary output densities (Song et al., 31 Jan 2025).
Classification-to-Regression via Span Extraction: Regression tasks are reformulated as span selection over a list of candidate numeric outputs appended to a text prompt, allowing unified span-extractive architectures originally designed for question answering to handle regression (Keskar et al., 2019).
Sentence-level and Semantic Regression: Sentence Semantic Regression (SSR) separates idea planning (semantic regression at the sentence level) from surface realization (generation of textual output), with the sentence vector space serving as the regression target and token sequence as the realized output (Wang et al., 2021).

2. Estimation and Loss Functions

The high dimensionality and sparsity of textual features create estimation challenges which are addressed by tailored penalization and loss function strategies:

Sparse Bayesian Estimation (Gamma-Lasso): To efficiently estimate MNIR models with tens of thousands of parameters, a gamma-lasso penalty is introduced. Each regression/loading coefficient receives an independent Laplace prior with a gamma hyperprior on its scale, yielding a non-concave penalty function:

$c(\varphi_j) = s \log(1 + |\varphi_j|/r)$

This design induces sparsity (many zero coefficients), corrects shrinkage bias for large effects, and supports parameter-specific regularization (Taddy, 2010).

Regression-tailored Losses for Ordinal Outputs: For tasks where labels are discrete but ordinal (e.g., semantic similarity scores), novel loss functions with a zero-gradient buffer zone are proposed. The Translated ReLU and Smooth K2 losses allow predictions within a specified margin to incur no penalty, penalizing only those outside the correct interval:

Translated ReLU:

$f(x) = \max(0, k(x - x_0))$

Smooth K2:

$f(x) = \begin{cases} 0 & \text{if } x < x_0 \ k(x^2 - 2x_0 x + x_0^2) & \text{if } x \geq x_0 \end{cases}$

This aligns the optimization more closely with downstream classification or ordinal regression goals (Zhang et al., 8 Jun 2024).

Cross-Entropy over Numeric Tokens: In models such as OmniPred and decoding-based regression, the loss is the cross-entropy between the predicted token sequence and the true numeric-token sequence, bypassing the need to standardize or scale outcome values across heterogeneous tasks (Song et al., 22 Feb 2024, Song et al., 31 Jan 2025).

3. Applications and Empirical Performance

Text-to-text regression frameworks have demonstrated broad applicability and strong performance in diverse real-world scenarios:

Sentiment and Political Affiliation Prediction: MNIR accurately predicts sentiment or partisanship from speech or review text, outperforming lasso, SVM, principal components, PLS, and topic modelling baselines in both speed and accuracy, with strong interpretability due to sparse, sentiment-associated phrase loadings (Taddy, 2010).
Readability Assessment and Ordinal Judgments: Prompt-based seq2seq models such as T5 and BART can be fine-tuned to discern the more difficult text in a pair using text-to-text pairwise ranking, supporting high cross-domain generalization and achieving up to 99.6% pairwise accuracy on standard datasets. Input-output prompt formulation has substantial impact on regression/ranking performance (Lee et al., 2023).
Semantic Similarity Regression: Regression-based approaches with tailored loss functions outperform classification-based sentence embedding models on established STS benchmarks, achieving average Spearman scores up to 83.55 across seven tasks with efficiency gains and reduced memory requirements (Zhang et al., 8 Jun 2024).
Performance Prediction in Large-Scale Systems: Encoder-decoder regression LLMs (RLMs) operating on raw textual configurations and logs achieve near-perfect rank correlation (up to 0.99) and 100-fold reduction in mean squared error relative to tabular methods on massive compute cluster efficiency prediction, with capacity for rapid few-shot adaptation to new deployment contexts (Akhauri et al., 26 Jun 2025).
Universal Regression and Surrogate Modeling: Universal regressors trained on multi-task textual data (OmniPred) generalize across diverse tasks and input spaces, outperforming traditional regressors (MLP, tree, GP) especially in low-data regimes, and supporting dynamic input schemas and uncertainty quantification for automation, benchmarking, and Bayesian optimization (Song et al., 22 Feb 2024).
Test-time Adaptation: Regression-specific test-time adaptation methods such as Significant-Subspace Alignment (SSA) adapt model feature extractors to new text domains by aligning principal subspaces that are informative for prediction, yielding robust gains in settings with domain shift and no labeled target data (Adachi et al., 4 Oct 2024).

4. Theoretical Properties and Mathematical Foundations

The text-to-text regression paradigm is underpinned by several key theoretical results:

Risk Bounds for Decoding-Based Regression: Sequential tokenization of numerics enables the expected mean integrated squared error (MISE) to decompose into a bias (bin width scaling as $2^{-2k}$ ) and a variance term (scaling as $2^k / N$ ), making decoding-based regression as universal as histogram density estimation but with much higher sample efficiency and expressivity (Song et al., 31 Jan 2025). For any smooth conditional density, sufficient sequence length (granularity) guarantees arbitrarily accurate approximation.
Sufficiency and Oracle Properties: The gamma-lasso MAP estimator in MNIR can achieve the strong oracle property: consistent estimation and variable selection under regularity conditions, with robustness to hyperparameter settings and data scaling (Taddy, 2010).
Neyman Orthogonality and Inference: Average partial effects in high-dimensional lasso logit text regression are estimated using Neyman-orthogonal score functions, allowing valid inference without the "oracle property"—that is, without requiring perfect variable/model selection (Chiang, 2018).
Simultaneous Inference with Bootstrap: Cluster-robust simultaneous confidence intervals for high-dimensional text regression models are constructed using multiplier cluster bootstrap methods, controlling family-wise error rates even in complex clustered text (e.g., internet forums) (Chiang, 2018).

5. Model Design Considerations and Limitations

Designing and deploying effective text-to-text regression solutions requires attention to several trade-offs and limitations:

Input and Output Representation: Textual encoding of parameters, categorical features, and numeric outputs enables high flexibility but places demands on tokenization schemes (e.g., digit-wise, floating point), which must preserve precision and support robust decoding (Song et al., 22 Feb 2024). Subword tokenizers may lose crucial information for numeracy or arithmetic tasks (Pal et al., 2021).
Discretization and Output Precision: Approaches that discretize regression outputs into candidate spans (for span-extractive models) are limited in numeric precision by the bucket granularity, while token-sequence-based approaches can, in principle, support arbitrary resolution (Keskar et al., 2019, Song et al., 31 Jan 2025).
Sample and Computational Efficiency: Many frameworks (MNIR, regression-based SSR, contrastive regression) are designed to work efficiently even in high dimensions, with tailored optimization (coordinate descent, batch size adaptation) to avoid the prohibitive computation typical of Bayesian or fully dense approaches (Taddy, 2010, Zhang et al., 8 Jun 2024).
Generalization and Extrapolation: Large text-to-text LLMs exhibit strong interpolation performance for numeracy and sequence tasks but often fail on extrapolation, indicating a need for digit-aware tokenization and specialized architectural or training modifications (Pal et al., 2021).
Adaptability and Domain Robustness: Frameworks such as SSA (Adachi et al., 4 Oct 2024) and RLM (Akhauri et al., 26 Jun 2025) demonstrate the need for adaptation mechanisms—whether subspace alignment or few-shot fine-tuning—to maintain accuracy under domain shift, emergent features, or task definition changes.

6. Broader Implications and Contemporary Relevance

Text-to-text regression unifies several strands of research in language modeling, statistical learning, and surrogate modeling, offering a general-purpose solution for predictive tasks in domains where traditional feature engineering is infeasible. The shift to textual representations and sequence-to-sequence architectures allows models to operate directly on heterogeneous, semi-structured, or evolving data, adapt rapidly to new problems, and serve as universal regressors or simulators for complex systems.

This framework underpins advances in sentiment analysis, system performance modeling, semantic similarity assessment, and transfer learning for prediction across scientific, industrial, and social science applications, supporting both precise estimation and uncertainty quantification. The development of optimized loss functions, scalable estimation schemes, and theoretically justified model heads positions text-to-text regression as a foundational methodology for data-centric, machine learning-driven analysis in the modern era.

Summary Table: Core Techniques in Text-to-Text Regression

Technique/Framework	Principle	Key Attribute
Multinomial Inverse Regression (MNIR) (Taddy, 2010)	Inverse multinomial modeling, sentiment-preserving reduction	Sparse gamma-lasso penalty, fast estimation
OmniPred (Song et al., 22 Feb 2024)	Textual LM regression over arbitrary input/output	Multi-task, no normalization needed
Decoding-based Regression (Song et al., 31 Jan 2025)	Sequential token output, flexible density estimation	Universal, sample-efficient, error correction
SSR (Semantic Regression) (Wang et al., 2021)	Sentence vector regression for idea planning and text generation	Mixed-granularity realization
Prompt-based Readability (Lee et al., 2023)	Text-to-text pairwise ranking via seq2seq prompts	Format-dependent generalization
SSA Test-time Adaptation (Adachi et al., 4 Oct 2024)	PCA subspace alignment and dimension weighting	Robust domain adaptation
Span Extraction (Keskar et al., 2019)	Reformulate regression as span selection	Unified model for classification and regression