Performance Prediction for Large Systems via Text-to-Text Regression (2506.21718v1)

Published 26 Jun 2025 in cs.LG, cs.AI, cs.PF, cs.SE, cs.SY, and eess.SY

Abstract: In many industries, predicting metric outcomes of large systems is a fundamental problem, driven largely by traditional tabular regression. However, such methods struggle on complex systems data in the wild such as configuration files or system logs, where feature engineering is often infeasible. We propose text-to-text regression as a general, scalable alternative. For predicting resource efficiency on Borg, Google's massive compute cluster scheduling system, a 60M parameter encoder-decoder, trained from random initialization, achieves up to a near perfect 0.99 (0.9 average) rank correlation across the entire fleet, and 100x lower MSE than tabular approaches. The model also easily adapts to new tasks in only 500 few-shot examples and captures the densities of complex outcome distributions. Ablation studies highlight the importance of using encoders, increasing sequence length, and the model's inherent uncertainty quantification. These findings pave the way for universal simulators of real-world outcomes.

Summary

The paper introduces text-to-text regression using an encoder-decoder architecture to predict numeric outcomes from raw, structured system data.
The methodology achieves near-perfect Spearman correlations and drastically lower MSE compared to traditional tabular methods.
The model’s efficient training from scratch enables robust few-shot adaptation and uncertainty quantification for scalable system optimization.

Text-to-Text Regression for System Performance Prediction: A Technical Overview

The paper "Performance Prediction for Large Systems via Text-to-Text Regression" (2506.21718) presents a comprehensive paper on leveraging Regression LLMs (RLMs) for predicting numeric outcomes in complex, large-scale systems, with a focus on Google's Borg compute cluster. The work systematically demonstrates that text-to-text regression, using encoder-decoder architectures trained from scratch, can outperform traditional tabular regression methods in both accuracy and adaptability, particularly in environments where feature engineering is infeasible due to the complexity and heterogeneity of system data.

Motivation and Problem Setting

Performance prediction in large-scale systems is a longstanding challenge, especially when the input data is high-dimensional, nested, and non-tabular (e.g., system logs, configuration files). Traditional regression approaches—random forests, MLPs—require fixed-length, flat feature vectors, necessitating extensive and brittle feature engineering. Even graybox methods, which combine domain knowledge with ML, are limited by their reliance on explicit, hand-crafted relationships between features and outcomes.

The paper targets the prediction of resource efficiency (MIPS per GCU) in Borg, where the input $x$ comprises a rich, structured snapshot of the cluster state, including hardware distributions, job profiles, scheduling hyperparameters, and temporal context. The output $y$ is a floating-point metric reflecting system productivity. The complexity and dynamism of $x$ make traditional featurization approaches impractical.

Methodology

Text-to-Text Regression with RLMs

The core methodological contribution is the application of text-to-text regression: representing all input features as a single string (e.g., YAML or JSON) and training an encoder-decoder LLM to predict the numeric outcome as a sequence of tokens. Key design choices include:

Decoding-based Output: The model is trained via next-token cross-entropy over the tokenized representation of $y$ , rather than using a regression head with MSE loss. This approach is shown to be more stable and effective for multi-task regression.
Encoder-Decoder Architecture: Contrary to the trend in LLMs, the paper finds that encoder-decoder models are essential for processing complex, structured inputs, outperforming decoder-only models even with similar parameter counts.
No Language Pretraining: The RLM is trained from random initialization, without leveraging general language pretraining. The authors argue, and empirically validate, that regression tasks do not benefit from semantic priors in natural language.
Custom Tokenization for $y$ : Numeric outputs are tokenized using a compact, normalization-free scheme (P10), encoding sign, mantissa, and exponent, which is critical for sample efficiency and multi-task training.
Context-Free Regression: The model is trained to map a single $x$ to $y$ , maximizing the use of sequence length for feature observability, rather than relying on in-context learning.

Training and Fine-Tuning

The RLM is pretrained on large pools of $(x, y)$ pairs from multiple tasks (cells and months), and can be fine-tuned on as few as 500 examples for rapid adaptation to new, unseen tasks. Fine-tuning is performed by resuming training from a checkpoint with a reduced learning rate, enabling efficient transfer learning and meta-learning capabilities.

Uncertainty Quantification

The model's probabilistic output $p_\theta(y|x)$ allows for both pointwise prediction (mean or median of samples) and density estimation, capturing aleatoric and epistemic uncertainty. The paper provides a theoretical and empirical analysis of the lower bounds on achievable MSE due to irreducible noise and partial observability.

Experimental Results

The empirical evaluation is extensive, covering both in-distribution and out-of-distribution tasks across 40 anonymized Borg cells. The main findings are:

Predictive Performance: The 60M parameter RLM achieves up to 0.99 Spearman rank correlation (0.9 average) and 100x lower MSE than tabular baselines, even when trained from scratch.
Few-Shot Adaptation: Fine-tuning on 500 examples from a new task yields performance comparable to in-distribution tasks, demonstrating strong transfer and meta-learning properties.
Feature Observability: Maximizing the amount of observed features in $x$ is critical; ablations show that sequence length and inclusion of temporal features significantly impact performance.
Model Size and Architecture: Increasing model size beyond 100M parameters yields diminishing returns; encoder-decoder architectures are consistently superior to decoder-only models for this task.
Uncertainty Calibration: The variance of $p_\theta(y|x)$ correlates with residual error, and the model can capture multi-modal outcome distributions, which is essential for downstream applications like Bayesian optimization.
Density Estimation: The RLM provides accurate density estimates, as measured by McFadden's Pseudo- $R^2$ , even in tasks with high aleatoric noise.

Implementation Considerations

Computational Requirements: The approach is computationally efficient; the default model (60M parameters, 2K sequence length) can be trained and fine-tuned on a single GPU.
Data Representation: All features are serialized as text, obviating the need for manual featurization or normalization. This enables rapid adaptation to new data sources and evolving system schemas.
Deployment: The model can be integrated into system optimization pipelines (e.g., Google Vizier) to provide fast, accurate performance predictions, replacing or augmenting traditional GP regressors.
Scalability: The method scales with data and feature complexity, not with model size, making it suitable for large, heterogeneous system datasets.

Implications and Future Directions

The paper's results have several important implications:

Universal Simulators: RLMs trained on raw system data can serve as universal simulators for complex environments, enabling rapid what-if analysis and optimization without expensive digital twin simulations.
Automated Feature Engineering: The text-to-text paradigm eliminates the need for domain-specific feature engineering, reducing maintenance overhead and increasing robustness to system changes.
Reward Modeling and RL: Accurate, uncertainty-aware regression from raw logs opens avenues for real-world reward modeling and reinforcement learning in operational systems.
Generalization Beyond Systems: The methodology is applicable to any domain where structured, non-tabular data must be mapped to numeric outcomes, including scientific simulations, finance, and healthcare.

Conclusion

This work establishes text-to-text regression with encoder-decoder RLMs as a practical, scalable, and highly effective approach for performance prediction in large, complex systems. The empirical evidence supports the claim that, with sufficient feature observability and data, relatively small models trained from scratch can achieve near-optimal predictive accuracy and uncertainty calibration, outperforming traditional methods by a substantial margin. The approach is poised to become a foundational tool for data-driven system optimization and simulation.

PDF Markdown

Related Papers

Tweets

https://twitter.com/XingyouSong/status/1939696735384678477

https://twitter.com/_akhaliq/status/1939701361203110141

https://twitter.com/mohsaied/status/1939805773921390820

https://twitter.com/fly51fly/status/1939802438439248217

https://twitter.com/TheTuringPost/status/1939846665730760803

https://twitter.com/deltaxistore/status/1939680617614004706

YouTube

Show All Videos

Reddit

"Performance Prediction for Large Systems via Text-to-Text Regression", Akhauri et al 2025 (16 points, 2 comments)