Text-to-Text Regression
- Text-to-Text Regression is a class of methods that maps text inputs to continuous outcomes, treating both input and sometimes output as textual data to handle high-dimensional linguistic features.
- This approach is applied across diverse fields for tasks such as sentiment analysis, system performance prediction, and semantic similarity assessment.
- Frameworks like inverse regression, decoding-based models, and prompt-based methods are used, addressing the challenges of high-dimensional text data through tailored estimation and loss functions.
Text-to-text regression is a class of predictive modeling methods where both the input and the output are represented as text, with the objective of learning a mapping from high-dimensional textual data to continuous outcomes or metrics. This paradigm extends classical regression to domains where textual documents serve as predictors and, in some approaches, where the target is also expressed or interpreted as text, such as real numbers tokenized as sequences or semantic similarity scores. It encompasses a diverse set of frameworks arising from statistics, machine learning, and natural language processing, designed to address challenges inherent in high-dimensional, unstructured, or non-tabular data sources.
1. Frameworks and Problem Formulations
A variety of frameworks have been developed to perform regression using text data, each tailored to overcome specific statistical and computational barriers:
- Inverse Regression and Sufficient Reduction: Multinomial Inverse Regression (MNIR) models the conditional distribution of token counts given a response variable (e.g., sentiment), enabling estimation of sufficient low-dimensional representations rich in sentiment or outcome information (1012.2098). This "sentiment-sufficient dimension reduction" allows recovery of key summary statistics from high-dimensional text, optimizing for predictive sufficiency relative to a continuous outcome.
- Direct Text Encoding Universal Regressors: The OmniPred paradigm treats regression as a mapping from parameter and context descriptions, fully encoded as text, to outcome metrics similarly tokenized as text strings, thereby learning over arbitrary (heterogeneous or hierarchical) input spaces (2402.14547).
- Decoding-based Regression: Causal sequence models are trained to predict numeric outcomes by auto-regressively generating the tokens of a number (for example, as base-10 digits), relying purely on cross-entropy loss over output strings rather than numeric losses or point estimates. This turns regression into a text generation problem, establishing theoretical parity with traditional regression while allowing the model to capture arbitrary output densities (2501.19383).
- Classification-to-Regression via Span Extraction: Regression tasks are reformulated as span selection over a list of candidate numeric outputs appended to a text prompt, allowing unified span-extractive architectures originally designed for question answering to handle regression (1904.09286).
- Sentence-level and Semantic Regression: Sentence Semantic Regression (SSR) separates idea planning (semantic regression at the sentence level) from surface realization (generation of textual output), with the sentence vector space serving as the regression target and token sequence as the realized output (2108.02984).
2. Estimation and Loss Functions
The high dimensionality and sparsity of textual features create estimation challenges which are addressed by tailored penalization and loss function strategies:
- Sparse Bayesian Estimation (Gamma-Lasso): To efficiently estimate MNIR models with tens of thousands of parameters, a gamma-lasso penalty is introduced. Each regression/loading coefficient receives an independent Laplace prior with a gamma hyperprior on its scale, yielding a non-concave penalty function:
This design induces sparsity (many zero coefficients), corrects shrinkage bias for large effects, and supports parameter-specific regularization (1012.2098).
- Regression-tailored Losses for Ordinal Outputs: For tasks where labels are discrete but ordinal (e.g., semantic similarity scores), novel loss functions with a zero-gradient buffer zone are proposed. The Translated ReLU and Smooth K2 losses allow predictions within a specified margin to incur no penalty, penalizing only those outside the correct interval:
Translated ReLU:
Smooth K2:
This aligns the optimization more closely with downstream classification or ordinal regression goals (2406.05326).
- Cross-Entropy over Numeric Tokens: In models such as OmniPred and decoding-based regression, the loss is the cross-entropy between the predicted token sequence and the true numeric-token sequence, bypassing the need to standardize or scale outcome values across heterogeneous tasks (2402.14547, 2501.19383).
3. Applications and Empirical Performance
Text-to-text regression frameworks have demonstrated broad applicability and strong performance in diverse real-world scenarios:
- Sentiment and Political Affiliation Prediction: MNIR accurately predicts sentiment or partisanship from speech or review text, outperforming lasso, SVM, principal components, PLS, and topic modelling baselines in both speed and accuracy, with strong interpretability due to sparse, sentiment-associated phrase loadings (1012.2098).
- Readability Assessment and Ordinal Judgments: Prompt-based seq2seq models such as T5 and BART can be fine-tuned to discern the more difficult text in a pair using text-to-text pairwise ranking, supporting high cross-domain generalization and achieving up to 99.6% pairwise accuracy on standard datasets. Input-output prompt formulation has substantial impact on regression/ranking performance (2302.13139).
- Semantic Similarity Regression: Regression-based approaches with tailored loss functions outperform classification-based sentence embedding models on established STS benchmarks, achieving average Spearman scores up to 83.55 across seven tasks with efficiency gains and reduced memory requirements (2406.05326).
- Performance Prediction in Large-Scale Systems: Encoder-decoder regression LLMs (RLMs) operating on raw textual configurations and logs achieve near-perfect rank correlation (up to 0.99) and 100-fold reduction in mean squared error relative to tabular methods on massive compute cluster efficiency prediction, with capacity for rapid few-shot adaptation to new deployment contexts (2506.21718).
- Universal Regression and Surrogate Modeling: Universal regressors trained on multi-task textual data (OmniPred) generalize across diverse tasks and input spaces, outperforming traditional regressors (MLP, tree, GP) especially in low-data regimes, and supporting dynamic input schemas and uncertainty quantification for automation, benchmarking, and Bayesian optimization (2402.14547).
- Test-time Adaptation: Regression-specific test-time adaptation methods such as Significant-Subspace Alignment (SSA) adapt model feature extractors to new text domains by aligning principal subspaces that are informative for prediction, yielding robust gains in settings with domain shift and no labeled target data (2410.03263).
4. Theoretical Properties and Mathematical Foundations
The text-to-text regression paradigm is underpinned by several key theoretical results:
- Risk Bounds for Decoding-Based Regression: Sequential tokenization of numerics enables the expected mean integrated squared error (MISE) to decompose into a bias (bin width scaling as ) and a variance term (scaling as ), making decoding-based regression as universal as histogram density estimation but with much higher sample efficiency and expressivity (2501.19383). For any smooth conditional density, sufficient sequence length (granularity) guarantees arbitrarily accurate approximation.
- Sufficiency and Oracle Properties: The gamma-lasso MAP estimator in MNIR can achieve the strong oracle property: consistent estimation and variable selection under regularity conditions, with robustness to hyperparameter settings and data scaling (1012.2098).
- Neyman Orthogonality and Inference: Average partial effects in high-dimensional lasso logit text regression are estimated using Neyman-orthogonal score functions, allowing valid inference without the "oracle property"—that is, without requiring perfect variable/model selection (1812.09397).
- Simultaneous Inference with Bootstrap: Cluster-robust simultaneous confidence intervals for high-dimensional text regression models are constructed using multiplier cluster bootstrap methods, controlling family-wise error rates even in complex clustered text (e.g., internet forums) (1812.09397).
5. Model Design Considerations and Limitations
Designing and deploying effective text-to-text regression solutions requires attention to several trade-offs and limitations:
- Input and Output Representation: Textual encoding of parameters, categorical features, and numeric outputs enables high flexibility but places demands on tokenization schemes (e.g., digit-wise, floating point), which must preserve precision and support robust decoding (2402.14547). Subword tokenizers may lose crucial information for numeracy or arithmetic tasks (2109.04672).
- Discretization and Output Precision: Approaches that discretize regression outputs into candidate spans (for span-extractive models) are limited in numeric precision by the bucket granularity, while token-sequence-based approaches can, in principle, support arbitrary resolution (1904.09286, 2501.19383).
- Sample and Computational Efficiency: Many frameworks (MNIR, regression-based SSR, contrastive regression) are designed to work efficiently even in high dimensions, with tailored optimization (coordinate descent, batch size adaptation) to avoid the prohibitive computation typical of Bayesian or fully dense approaches (1012.2098, 2406.05326).
- Generalization and Extrapolation: Large text-to-text LLMs exhibit strong interpolation performance for numeracy and sequence tasks but often fail on extrapolation, indicating a need for digit-aware tokenization and specialized architectural or training modifications (2109.04672).
- Adaptability and Domain Robustness: Frameworks such as SSA (2410.03263) and RLM (2506.21718) demonstrate the need for adaptation mechanisms—whether subspace alignment or few-shot fine-tuning—to maintain accuracy under domain shift, emergent features, or task definition changes.
6. Broader Implications and Contemporary Relevance
Text-to-text regression unifies several strands of research in LLMing, statistical learning, and surrogate modeling, offering a general-purpose solution for predictive tasks in domains where traditional feature engineering is infeasible. The shift to textual representations and sequence-to-sequence architectures allows models to operate directly on heterogeneous, semi-structured, or evolving data, adapt rapidly to new problems, and serve as universal regressors or simulators for complex systems.
This framework underpins advances in sentiment analysis, system performance modeling, semantic similarity assessment, and transfer learning for prediction across scientific, industrial, and social science applications, supporting both precise estimation and uncertainty quantification. The development of optimized loss functions, scalable estimation schemes, and theoretically justified model heads positions text-to-text regression as a foundational methodology for data-centric, machine learning-driven analysis in the modern era.
Summary Table: Core Techniques in Text-to-Text Regression
Technique/Framework | Principle | Key Attribute |
---|---|---|
Multinomial Inverse Regression (MNIR) (1012.2098) | Inverse multinomial modeling, sentiment-preserving reduction | Sparse gamma-lasso penalty, fast estimation |
OmniPred (2402.14547) | Textual LM regression over arbitrary input/output | Multi-task, no normalization needed |
Decoding-based Regression (2501.19383) | Sequential token output, flexible density estimation | Universal, sample-efficient, error correction |
SSR (Semantic Regression) (2108.02984) | Sentence vector regression for idea planning and text generation | Mixed-granularity realization |
Prompt-based Readability (2302.13139) | Text-to-text pairwise ranking via seq2seq prompts | Format-dependent generalization |
SSA Test-time Adaptation (2410.03263) | PCA subspace alignment and dimension weighting | Robust domain adaptation |
Span Extraction (1904.09286) | Reformulate regression as span selection | Unified model for classification and regression |