Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 81 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 99 tok/s Pro

Kimi K2 195 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Regression Language Models for Code (2509.26476v1)

Published 30 Sep 2025 in cs.CL, cs.AI, cs.LG, cs.PF, and cs.SE

Abstract: We study code-to-metric regression: predicting numeric outcomes of code executions, a challenging task due to the open-ended nature of programming languages. While prior methods have resorted to heavy and domain-specific feature engineering, we show that a single unified Regression LLM (RLM) can simultaneously predict directly from text, (i) the memory footprint of code across multiple high-level languages such as Python and C++, (ii) the latency of Triton GPU kernels, and (iii) the accuracy and speed of trained neural networks represented in ONNX. In particular, a relatively small 300M parameter RLM initialized from T5Gemma, obtains > 0.9 Spearman-rank on competitive programming submissions from APPS, and a single unified model achieves > 0.5 average Spearman-rank across 17 separate languages from CodeNet. Furthermore, the RLM can obtain the highest average Kendall-Tau of 0.46 on five classic NAS design spaces previously dominated by graph neural networks, and simultaneously predict architecture latencies on numerous hardware platforms.

Summary

The paper presents a novel unified model that predicts numeric code metrics directly from raw code using an autoregressive numeric output, bypassing manual feature engineering.
It employs a pretrained T5Gemma encoder combined with a decoder that outputs digitized numeric tokens, achieving high rank correlations on diverse datasets across multiple languages.
The approach supports multi-task predictions including memory, latency, and accuracy, offering robust and transferable insights for code optimization and hardware-software co-design.

Regression LLMs for Code: Unified Code-to-Metric Prediction

Motivation and Problem Statement

The paper addresses the challenge of code-to-metric regression, i.e., predicting numeric outcomes (such as memory usage, latency, or accuracy) directly from code or computational graph representations. Traditional approaches rely heavily on domain-specific feature engineering, which is brittle and non-transferable across languages, compilers, or hardware platforms. The authors propose a unified Regression LLM (RLM) that leverages pretrained LLM encoders (specifically T5Gemma) and a decoder-based numeric output scheme to predict metrics from raw code or graph text, spanning multiple programming languages and representation levels.

Methodology

Model Architecture

The RLM is structured as an encoder-decoder transformer, initialized from T5Gemma. The encoder ingests code or graph representations as plain text, while the decoder autoregressively generates numeric tokens representing the target metric(s). Numeric outputs are tokenized digit-by-digit, including sign and exponent tokens, enabling normalization-free regression and robust handling of wide-ranging target values. Constrained decoding ensures valid numeric outputs and supports both pointwise prediction and density estimation for uncertainty quantification.

Multi-Task and Multi-Objective Regression

The model is trained on a mixture of regression tasks, including:

Memory footprint prediction for code in Python, C++, and other languages
Latency estimation for Triton GPU kernels
Accuracy and speed prediction for neural networks represented in ONNX IR

The autoregressive decoder enables conditional modeling of multiple objectives (e.g., predicting latency and then accuracy), capturing inter-metric dependencies that parallel regression heads cannot.

Data Sources

The RLM is trained and evaluated on diverse datasets:

CodeNet: 7.3M code samples across 17 languages, with memory usage labels
APPS Leetcode: 98.9K Python problems, with peak memory usage
KernelBook: 12.6K Triton GPU kernels, with latency measurements
NASBench, FBNet, OFA, SNAS, etc.: 520K neural architectures in ONNX IR, with accuracy, FLOPs, parameter count, and hardware-specific latencies

Experimental Results

High-Level Code Regression

On APPS Leetcode, the RLM achieves Spearman rank correlation $p > 0.9$ for memory prediction.
On CodeNet, a single model achieves $p > 0.5$ average across 17 languages, with best results on C++ ( $p = 0.748$ ) and competitive performance on niche languages (Lua, Haskell, etc.).
The RLM can rank candidate solutions within a problem, outperforming random selection in identifying the most memory-efficient code.

Kernel and Graph Regression

For Triton kernel latency, the RLM achieves $p = 0.516$ .
On NAS benchmarks, the RLM matches or exceeds state-of-the-art GNN-based methods (e.g., FLAN, Arch2Vec, CATE) in Kendall-Tau rank correlation, with average $\tau = 0.46$ across five NAS design spaces.
The RLM supports multi-metric prediction, accurately modeling Pareto frontiers of accuracy vs. latency across multiple hardware platforms.

Ablation Studies

Pretraining: Initializing from a language-pretrained encoder (T5Gemma) accelerates convergence and improves final regression metrics. Regression pretraining on synthetic tasks (e.g., FLOPs) further boosts performance.
Decoder vs. Regression Head: Decoder-based numeric output (cross-entropy loss) outperforms explicit regression heads (MSE loss), especially in normalization-free settings and with wide-ranging target values.
Scaling: Larger encoder models (600M vs. 300M parameters) yield higher rank correlations, but require careful hyperparameter tuning and more compute.
Tokenization and Sequence Length: Custom tokenizers (e.g., ONNX-aware) and longer input sequences improve performance, especially for large graphs.
Fine-Tuning: Fine-tuning on specific languages does not consistently improve performance if the pretraining corpus is sufficiently rich, but is beneficial for low-resource NAS tasks.

Implementation Considerations

Training: The RLM is trained with Adafactor, using a linear warmup and cosine decay schedule. Input sequences are cropped to a maximum length (2048 tokens), with custom tokenization for code and ONNX graphs.
Inference: Median aggregation over multiple decoder samples (default: 64) is used for pointwise prediction. Increasing sample size can improve accuracy.
Resource Requirements: The default model (300M parameters) is tractable for modern hardware, but scaling to larger models or longer sequences may require distributed training and memory optimization.
Deployment: The normalization-free decoder and text-based input format enable straightforward deployment across diverse codebases and hardware platforms, without the need for feature engineering or dataset-specific normalization.

Implications and Future Directions

The RLM paradigm demonstrates that unified, text-based regression models can subsume traditional feature engineering and specialized graph models for code-to-metric prediction. This approach aligns with the LLM paradigm, treating regression as next-token prediction and leveraging transfer learning from language data. The ability to predict multiple metrics from raw code or graph text has direct applications in program search, hardware-software co-design, compiler optimization, and automated machine learning.

Open questions remain regarding the generalization of RLMs to entirely new code domains, the limits of scaling, and the potential for predicting experimental outcomes from raw code. Further research may explore larger models, more diverse pretraining corpora, and integration with program synthesis and optimization pipelines.

Conclusion

The paper establishes Regression LLMs as effective, normalization-free, and transferable predictors for code-to-metric regression across languages, compilers, and hardware. By leveraging pretrained LLM encoders and autoregressive numeric decoders, RLMs outperform or match specialized GNNs and regression heads, while simplifying the modeling pipeline and enabling multi-task, multi-objective prediction. This work provides a foundation for future research in unified code analysis, performance prediction, and automated optimization.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper is about teaching a LLM to look at code (or a description of a neural network) and predict useful numbers about it—like how much memory it will use, how fast it will run, or how accurate a model might be—without actually running the code. The authors call this kind of system a Regression LLM (RLM).

Think of it like this: you show the model a “recipe” (the code), and it learns to estimate the cooking time, the number of dishes it makes, and the calories, just by reading the recipe.

What questions did the paper ask?

The authors wanted to know:

Can one single model, reading only text, predict different kinds of numbers for many kinds of code?
Can it handle multiple programming languages (like Python, C++, Haskell), GPU kernels (special programs for graphics cards), and neural networks (described in a standard format called ONNX)?
Can it match or beat specialized models that were carefully engineered for these tasks?
Can it predict several related numbers at once (for example, speed on different devices and accuracy), and understand how they relate?

How did they do it?

They built a unified Regression LLM (RLM) that treats “predicting a number” like “writing the next tokens.” Here’s the approach in everyday terms:

The model has two parts: a “reader” (encoder) that reads the input text (like source code or a network graph in ONNX), and a “writer” (decoder) that outputs numbers as text, one digit at a time.
Instead of designing custom features for each kind of code (which is time-consuming), they feed the raw text directly to the model. This avoids “feature engineering.”
The model was initialized from a pretrained LLM called T5Gemma, so it already “understands” a lot of programming words and patterns.
When outputting numbers, the model writes them digit-by-digit using special number tokens (like signs and exponents). This helps it handle tiny numbers (like 0.01) and huge numbers (like 1,000,000) without tricky rescaling.
They trained the same model on many tasks at once (multi-task), such as:
- Memory use for solutions to coding problems (from APPS and CodeNet datasets).
- Runtime (latency) of GPU kernels written in Triton.
- Accuracy and latency of neural networks from several architecture search spaces, represented in ONNX text.
They also let the model predict several metrics in sequence (multi-objective). For example, it can first predict accuracy, then predict latency on different devices. Predicting in sequence helps it learn relationships, like “very low latency might mean the model is too small and less accurate.”

A quick translation of technical terms:

ONNX: a common “blueprint” format for neural networks, like a wiring diagram that shows all the layers and how data flows.
Latency: how long something takes to run.
Memory footprint: how much memory it uses while running.
Rank correlation (Spearman/Kendall): a score showing how well the model orders items from best to worst (even if its exact number guesses aren’t perfect). High scores mean it’s good at ranking.

What did they find?

The model worked surprisingly well across very different tasks, using only text as input. Highlights include:

Multiple programming languages: The model could rank programs by memory usage across many languages (like C++, Python, Haskell, Lua). On a coding dataset (APPS), it achieved a very high ranking score (Spearman ≈ 0.93) for memory prediction, meaning it was excellent at ordering solutions by how memory-hungry they were.
GPU kernels: It could predict how long Triton GPU kernels would take to run, using just the text of the kernel.
Neural networks (NAS): Reading ONNX “blueprints,” the model matched or slightly beat state-of-the-art methods designed specifically for predicting neural network performance (average Kendall-Tau ≈ 0.46), even without extra hints those methods often need. It also predicted latencies on different hardware and captured the trade-offs between accuracy and speed.
One model for many tasks: A single, relatively small model (about 300 million parameters) handled all of this at once. Training on multiple tasks didn’t hurt performance—sometimes it helped.
Helpful design choices:
- Pretraining matters: Starting from a LLM (T5Gemma) and also pretraining on simple synthetic tasks sped up learning and improved accuracy.
- Digit-by-digit number outputs were better than attaching a typical “regression head” (a separate number-prediction layer), and they didn’t need any number normalization tricks.
- Bigger encoders helped (within compute limits).
- Custom tokenization and longer input length (so the model can read more of a large graph) boosted results.

Why these results are important:

The model can rank code or architectures well, which is often what you need to choose a better option quickly (e.g., which solution uses the least memory).
It removes the need for complicated, hand-crafted features for each new language or graph structure.

Why does this matter?

This work suggests a simpler future for performance prediction:

Instead of building a new, specialized predictor for every coding language, hardware device, or neural network format, we can train one text-based model to handle many of them.
Developers could quickly estimate the memory or speed of code before running it, helping them select better solutions (e.g., choosing the most memory-efficient program among many submissions).
AI researchers and engineers could use it to shortlist faster, more accurate neural network designs, saving time and compute.
Compiler and hardware designers could try out more ideas virtually, using predictions to guide optimizations and co-design (software + hardware tuned together).

In short, the paper shows that a LLM can be a strong, general “reader of code” that predicts useful numbers across many settings—making performance prediction more flexible, faster to set up, and easier to reuse.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of what remains missing, uncertain, or unexplored in the paper, formulated to guide future research.

Zero-shot generalization: Evaluate the RLM’s ability to predict metrics for entirely new problems and languages not seen during training (e.g., CodeNet-style tasks with new questions and inputs), rather than relying on per-question few-shot overlap between train and test.
Input-dependent metrics: Memory and latency depend strongly on runtime inputs; incorporate explicit program inputs into the representation and quantify generalization across varying input distributions and sizes.
Hardware coverage for kernels: Triton latency was measured only on an NVIDIA A6000; expand to diverse GPUs (architectures, drivers, clocks) and assess cross-hardware transferability.
Kernel diversity and harness bias: Current Triton data is largely TorchInductor-generated and filtered by an automated harness; include hand-written kernels and alternative toolchains, and quantify selection bias caused by harness failures.
Long-context limitations: Many ONNX graphs (e.g., Inception with ~23K tokens median) exceed the tested encoder context (≤4K). Characterize truncation strategies, their impact on accuracy, and investigate long-sequence or hierarchical encoders to handle full graphs.
Structural inductive bias: Text-only encoding may underutilize graph structure. Compare against or integrate graph-aware models (graph transformers, hybrid text+graph encoders) to improve precision on large, structured computation graphs.
Uncertainty calibration: The decoder produces predictive densities, but calibration is not quantified. Evaluate calibration quality (e.g., NLL, CRPS, coverage of prediction intervals) and compare to calibrated baselines.
Absolute-error reporting: Beyond rank correlations (Spearman/Kendall), report MAE/RMSE and practical error bounds (e.g., % error) to assess decision utility for compilers and hardware design.
Multi-objective ordering effects: Investigate how the order of autoregressively decoded metrics (e.g., accuracy before latencies) affects learned dependencies and performance; compare to jointly modeled or order-invariant approaches.
Few-shot adaptation in practice: Demonstrate and measure “pretrain then fine-tune” few-shot adaptation to truly novel tasks (new languages, IRs, hardware) with limited labels, including ablations on task weighting and sampling.
Out-of-distribution operations: Test zero-shot performance on ONNX operators, kernels, and language constructs withheld during training to assess robustness to evolving toolchains and languages.
CodeNet inputs gap: Create or adopt compiled-language datasets with explicit inputs to enable realistic memory/latency prediction and true generalization across problems, addressing the current inability to evaluate zero-shot.
Broader baselines for code tasks: For code memory/latency prediction, include strong domain baselines (static analysis/analytical models, feature-engineered regressors) to assess competitiveness beyond NAS-focused comparisons.
Measurement reproducibility: Standardize and report APPS execution environment (Python/OS/allocator/sandbox), quantify measurement noise, and assess its impact on training/evaluation and on comparative metrics.
Hardware descriptors as inputs: Evaluate whether adding structured hardware metadata (e.g., microarchitecture, cache sizes, runtime backend) improves multi-hardware latency prediction beyond decoding separate labels.
Latency ground truth provenance: Clarify how NAS latencies are obtained (real measurements vs simulators), the runtime/backend used, and test robustness across different backends and operator libraries.
Numeric tokenization variants: Explore alternative numeric encodings (e.g., logarithmic scaling, different bases, mixed-precision tokens) for extreme values and assess effects on convergence, bias, and stability.
Scaling laws and efficiency: Provide scaling curves (params vs performance), hyperparameter recipes for larger models (beyond 600M), and measure inference throughput/latency and memory footprint to gauge deployability.
Interpretability and attribution: Develop methods to attribute predictions to code/graph regions (e.g., operator-level saliency, counterfactual edits) to support compiler/hardware optimization workflows.
Fine-grained predictions: Extend beyond scalar metrics to per-node/per-block latency/memory estimates; validate against profilers to enable actionable optimization.
Robustness to superficial changes: Assess sensitivity to formatting, renaming, comments, dead code, and obfuscation; design invariance-promoting preprocessing or training objectives where appropriate.
Dataset biases and selection effects: Quantify biases introduced by acceptance filtering (CodeNet) and harness-induced failures (KernelBook), and construct balanced datasets to mitigate skewed learning.
Cross-language transfer: Measure transfer between languages (e.g., training on C++/Python, testing on Haskell/Lua), and compare T5Gemma against code-pretrained encoders (e.g., CodeT5) for improved multilingual code regression.
Compiler flags and optimization levels: Incorporate compiler configuration as inputs and paper their impact on predicted metrics; evaluate if the RLM can model such effects.
Safe labeling pipelines: Evaluate the risks of executing arbitrary code for label collection; propose secure sandboxing protocols or static surrogate labeling methods to scale dataset creation safely.
Distribution-shift stress tests: Construct adversarial/synthetic programs designed to confound superficial text patterns and report robustness under controlled distribution shifts.
End-to-end optimization gains: Demonstrate practical benefits (speedups, solution quality) when using RLM predictions inside program search, compiler autotuning, or NAS loops; quantify improvements versus established pipelines.
Standardized benchmarks: Release public, reproducible code-to-metric benchmarks including inputs, environments, and hardware metadata to enable fair comparison and longitudinal tracking of progress.

View Paper Prompt View All Prompts

Practical Applications

Overview

Below are actionable, real-world applications that follow directly from the paper’s findings and methods on Regression LLMs (RLMs) for predicting numeric outcomes (memory, latency, accuracy) from code and computation-graph text. Each item specifies sector(s), concrete use cases, potential tools/products/workflows, and key assumptions/dependencies that affect feasibility.

Immediate Applications

Sector: Software/Compilers, ML Systems (TVM, TorchInductor, XLA, Triton)
- Use case: Static kernel selection and autotuning. Use RLM to rank Triton GPU kernel variants and compiler schedules by predicted latency before execution.
- Tools/workflows: “RLM Autotuner” pass integrated into TorchInductor/TVM; pre-compile ranking service; CI hooks to detect performance regressions.
- Assumptions/dependencies: Availability of representative training data on target hardware; ONNX/Triton textual IRs; modest fine-tuning for new devices; careful tokenizer/sequence-length settings for long graphs.
Sector: MLOps/Inference Platforms, Edge/Embedded
- Use case: Architecture selection under latency/accuracy constraints across devices (mobile, CPU, GPU, ASIC). Use the RLM’s multi-objective decoding to surface Pareto-optimal candidate models.
- Tools/workflows: “Multi-Target Latency Estimator” microservice; deployment planner that filters model zoo by device-specific latency targets; capacity planning dashboards.
- Assumptions/dependencies: Access to ONNX exports; calibration per hardware SKU; acceptance of rank-based decisions (RLM excels at ranking via Spearman/Kendall, not exact values).
Sector: Developer Tooling/IDEs, Code Review
- Use case: Memory-usage warnings and suggestions during coding. Flag high-memory Python/C++ patterns; rank alternative solutions by predicted peak memory (e.g., APPS-like coding tasks).
- Tools/workflows: “RLM Performance Advisor” IDE plugin; PR bot that annotates proposed changes with predicted memory/latency impacts and safer patterns (e.g., hash-based vs O(1)-memory loops).
- Assumptions/dependencies: Tokenization that handles language-specific constructs; availability of typical code snippets for fine-tuning; model latency acceptable for interactive IDE use (300M parameters is practical on modern workstations).
Sector: Cloud/DevOps/FinOps
- Use case: Cost-aware scheduling and instance-type recommendations by predicting job latency/memory prior to deployment; gating merges on predicted resource regressions.
- Tools/workflows: CI/CD “Performance Gate” (block merges that push beyond SLOs); job scheduler enriched with predicted runtime/memory; cost forecasting using predicted latency across instance families.
- Assumptions/dependencies: Mapping latency to cost per cloud SKU; routine re-calibration for hardware/driver updates; reliability of ONNX/code export in pipelines.
Sector: NAS Research/Engineering
- Use case: Faster NAS via unified surrogate models that predict accuracy and device-specific latency from ONNX, removing bespoke features/zero-cost proxies.
- Tools/workflows: “RLM Surrogate API” for black-box/search algorithms; Pareto-front exploration UI powered by autoregressive multi-metric predictions.
- Assumptions/dependencies: ONNX availability for candidate architectures; synthetic pretraining (e.g., FLOPs) to accelerate convergence; out-of-distribution checks when exploring new search spaces.
Sector: HPC/Scientific Computing
- Use case: Quick performance triage of scientific kernels from source without full runs; ranking candidate implementations by predicted memory and runtime.
- Tools/workflows: Pre-commit hook for resource checks; batch ranking of code variants produced by autotuners or DSLs (e.g., Halide/MLIR pipelines).
- Assumptions/dependencies: Sufficient domain coverage in training data or brief fine-tuning; correlation between code-level text and actual runtime on target cluster.
Sector: Education
- Use case: Feedback on resource complexity in programming assignments; demonstrating tradeoffs between time and memory with immediate predictions.
- Tools/workflows: LMS plugin that annotates student submissions with predicted memory/latency; interactive labs comparing algorithmic patterns.
- Assumptions/dependencies: Sandboxed or static (non-executing) evaluation; curated examples for niche languages to avoid tokenization gaps.
Sector: Open-source/Benchmarking
- Use case: Standardized evaluation and reproduction using the provided code and dataset; baseline for code-to-metric regression across languages.
- Tools/workflows: Adoption of the released GitHub/HF assets to build internal benchmarks; continuous data augmentation with synthetic tasks (FLOPs) to improve domain coverage.
- Assumptions/dependencies: License compatibility; data sufficiency for the target domains; reproducible export to text (ONNX, kernel IRs).

Long-Term Applications

Sector: Software/Compilers (LLVM/MLIR), Auto-scheduling
- Use case: Universal learned performance oracle in compilers. Replace heuristic cost models with a single RLM that reads text IR across passes and devices, guiding end-to-end optimization.
- Tools/products: “RLM Cost Model” drop-in for LLVM/MLIR/Tensor compilers; learned pass ordering and parameter selection.
- Assumptions/dependencies: Robust generalization to unseen ops/kernels; standardized textual IRs; fallback strategies for rare cases; monitoring and continual learning loops.
Sector: Hardware–Software Co-Design
- Use case: Joint exploration of algorithm, architecture, and hardware configurations using RLM predictions across target devices to shrink design cycles.
- Tools/products: Co-design studio integrating RTL/µarch configs and NN graphs as text; active learning to collect targeted measurements for new silicon.
- Assumptions/dependencies: Access to early hardware specs or simulators for initial labels; domain shift handling across process nodes and toolchains.
Sector: Energy/Sustainability, Policy/Compliance
- Use case: “Green coding” labels and compliance gates that estimate energy or carbon proxies via latency/memory predictions; procurement policies favoring energy-efficient code paths.
- Tools/products: CI “Eco-Gate” with predicted energy budgets per feature; public efficiency scorecards; procurement checklists requiring RLM-based estimates.
- Assumptions/dependencies: Validated mappings from latency/memory to energy per device; regulator acceptance of learned estimators; standardized reporting formats.
Sector: Autonomous Code Optimization Agents
- Use case: Agents that propose code rewrites and verify improvements via RLM feedback loops (optimize for memory, latency, or device-specific constraints).
- Tools/products: “RLM-in-the-loop” refactoring assistant; automated PRs with predicted resource deltas and uncertainty estimates.
- Assumptions/dependencies: Reliable uncertainty quantification; guardrails for correctness/security; evaluation harnesses where ground truth is occasionally measured to prevent drift.
Sector: Robotics, AR/VR, Real-time Systems
- Use case: On-device runtime selection of models and algorithms based on predicted latency under changing conditions (e.g., battery, thermal throttling).
- Tools/products: Runtime policy engine that queries an on-device RLM to select models; adaptive quality-of-service controllers.
- Assumptions/dependencies: Compact, efficient RLM variants or distillation; adaptation to dynamic hardware states; safety constraints.
Sector: Healthcare/Clinical ML
- Use case: Pre-deployment checks for model variants to meet latency/SLA in clinical workflows without compromising accuracy; planning for edge clinical devices.
- Tools/products: Model card extensions with predicted latency/accuracy on target hardware; hospital IT planning tools.
- Assumptions/dependencies: High assurance, rigorous validation; domain-specific ONNX graphs; governance for learned predictions.
Sector: Finance/Trading/Batch Analytics
- Use case: Cost and SLA forecasting for analytical pipelines from code graphs; pre-commit checks to control compute spend.
- Tools/products: Budgeting and alerting dashboards that use RLM predictions to flag expensive changes; portfolio-of-pipelines Pareto planning.
- Assumptions/dependencies: Stable mapping from predicted latency to cost; secure handling of proprietary code graphs; periodic re-labeling for drift.
Sector: Experiment Outcome Prediction (AutoML/Science)
- Use case: Predict numeric outcomes of entire experiments from code/configs (beyond models), e.g., training time, validation metrics under specified hyperparameters.
- Tools/products: “Experiment Forecaster” for scheduling and resource allocation; early stopping policies triggered by predicted poor outcomes.
- Assumptions/dependencies: Rich meta-data/text for experiments; multi-domain pretraining beyond code; careful modeling of confounders and data-dependent effects.
Sector: Standards and Interoperability
- Use case: Community standards for text-first performance modeling (e.g., ONNX-based performance descriptors) enabling plug-and-play predictors across ecosystems.
- Tools/products: Schema for performance-related text IR; certification benchmarks for learned predictors.
- Assumptions/dependencies: Industry consensus; governance for versioning of ops and kernels; compatibility with existing MLIR/ONNX efforts.

Cross-cutting assumptions and dependencies

Domain adaptation: Best results require fine-tuning on target tasks/hardware; synthetic pretraining (e.g., FLOPs) accelerates convergence.
Ranking vs absolute prediction: RLMs are strongest as rankers (high Spearman/Kendall). Workflows should prefer selection/ranking over hard thresholds when possible.
Textual availability: Must export code/graphs to textual forms (e.g., ONNX dumps, kernel IRs) with sufficient context length and appropriate tokenization.
Continual calibration: Hardware, drivers, and toolchains evolve; active learning and periodic ground-truth measurements will be needed to maintain accuracy.
Compute and latency: 300M-parameter models are deployable but require resource planning (possibly distillation or server-side inference for IDEs and CI).
Safety and governance: For policy/regulated sectors, predictions need uncertainty estimates, audit logs, and fallback s to hand-crafted analyses for critical decisions.

View Paper Prompt View All Prompts

Glossary

APPS: A benchmark dataset of programming problems and solutions used to evaluate code understanding and performance prediction. "APPS Leetcode: Hendrycks et al. (2021) contains 10K Python problems, with 232.4K ground-truth solutions and 131.7K test cases."
Arch2Vec: A graph-based architecture embedding method that uses a graph autoencoder to represent neural network architectures. "Arch2Vec (Graph Enc.)"
ASIC: Application-Specific Integrated Circuit; specialized hardware optimized for particular tasks. "Pixel3 (Mobile), Eyeriss (ASIC), Intel CPU and Nvidia GPU"
Auto-differentiation graph: The computation graph that captures operations and gradients used for training neural networks. "contains full information about the auto-differentiation graph used"
Autoregressive: A modeling approach that generates outputs sequentially, conditioning on previous outputs. "Due to the decoder's autoregressive nature, consecutively decoding more numbers also allows con- ditionally modeling multiple objectives"
Bayesian Optimization: A sample-efficient optimization framework using probabilistic surrogate models to guide search. "for the use in Gaussian Processes (Kandasamy et al., 2018; Ru et al., 2021) for Bayesian Optimization"
CATE: A transformer-based architecture encoding method that derives token sequences from adjacency matrices to model global graph structure. "CATE (Transformer Enc.)"
CodeNet: A large-scale multi-language code dataset used for tasks such as memory and latency prediction. "CodeNet: (Puri et al., 2021) introduces a large-scale dataset consisting of 14M code samples over 37 languages."
Constrained decoding: A decoding procedure that restricts token generation to satisfy validity constraints (e.g., produce a well-formed number). "At inference, constrained decoding is performed to ensure a valid number is always sampled"
Decoder head: Using the LLM’s decoder to directly generate numeric outputs as tokens rather than predicting with a separate regression module. "yet the decoder head remains best (Spearman's p = 0.800)"
Density estimation with uncertainty quantification: Predicting a distribution over outputs (not just a point estimate) and expressing uncertainty in the prediction. "density estimation with uncertainty quantification (Song & Bahri, 2025)."
Digit-by-digit numeric tokenization: Representing numbers as sequences of specialized tokens for sign, exponent, and mantissa to improve numeric regression. "use explicit digit-by-digit numeric tokenizations - similar to (Song & Bahri, 2025), we represent y using special sign, exponent, and mantissa tokens"
Encoder-decoder: A sequence-to-sequence architecture where an encoder processes inputs and a decoder generates outputs. "The RLM is best structured as an encoder-decoder, which allows input representations of x to be purely in text"
Few-shot: Adapting a model to a new task with very limited training examples. "few-shot adapt to a new regression task via fine-tuning."
FLOPS: Floating Point Operations Per Second; a measure of computational throughput often used as a performance proxy. "to predict floating point operations per sec- ond (FLOPS) for each architecture."
Gaussian Processes: Nonparametric probabilistic models often used as surrogates in Bayesian optimization. "graph kernels for the use in Gaussian Processes (Kandasamy et al., 2018; Ru et al., 2021)"
GNN: Graph Neural Network; a neural architecture that operates on graph-structured data. "state-of-the-art graph neural network (GNN)-based regression methods"
Graph autoencoder: A neural network that learns compressed representations of graphs by reconstructing them from embeddings. "which uses a graph autoen- coder"
Graph kernels: Functions that compute similarities between graphs to enable kernel methods like Gaussian processes. "creating graph kernels for the use in Gaussian Processes"
Hardware embeddings: Vector representations of hardware characteristics used to improve cross-device performance prediction. "additional techniques include hardware embeddings (Akhauri & Abdelfattah, 2024a, 2023)"
Inductive bias: The set of assumptions a model makes to generalize beyond observed data. "presumably due to questions around their inductive bias"
Intermediate Representation (IR): A standardized, hardware-agnostic format for representing computation graphs. "intermediate representation (IR) (ONNX Community, 2017)"
Kendall-Tau: A rank correlation statistic measuring ordinal association between two rankings. "highest average Kendall-Tau of 0.46"
KernelBook: A dataset pairing PyTorch programs with Triton kernels for profiling and latency estimation. "KernelBook (Paliskara & Saroufim, 2025) pairs PyTorch programs with Triton kernels"
Mantissa tokens: Tokens representing the significant digits of a number in digit-by-digit numeric tokenization. "we represent y using special sign, exponent, and mantissa tokens"
Multi-objective modeling: Jointly modeling multiple dependent target metrics (e.g., accuracy and latency) in a conditional sequence. "3.2. Multi-Objective Modeling"
NAS (Neural Architecture Search): Automated methods for designing neural network architectures, often requiring performance predictors. "standard neural architecture search (NAS) benchmarks."
No-free-lunch theorem: The principle that no single model performs best across all possible tasks without task-specific assumptions or data. "This can be more broadly seen as a consequence of the "no-free-lunch" theorem"
ONNX: Open Neural Network Exchange; a standardized format for interoperable representation of neural networks. "Open Neural Network Exchange (ONNX)"
Pareto frontier: The set of non-dominated solutions trading off multiple objectives, where improving one worsens another. "predicted Pareto- frontier"
Pareto-optimal: A solution for which no objective can be improved without degrading another. "predicted Pareto-optimal points x *."
Path encodings: Representations of architectures by enumerating paths through their computation graphs. "through the use of path encodings (White et al., 2021a)"
Probits: A statistical transform mapping probabilities to the quantiles of a standard normal distribution. "axes are scaled by percentile (probits)"
Regression head: An auxiliary prediction layer (often an MLP) trained with losses like MSE on top of encoder embeddings. "explicit regression head (e.g., an MLP on pooled encoder states)"
Regression LLM (RLM): A LLM trained to perform regression by generating numeric outputs as text. "Regression LLM (RLM)"
SentencePiece: A subword tokenization method that learns a vocabulary directly from raw text. "using SentencePiece (Kudo & Richardson, 2018) tokenization"
Spearman rank correlation: A nonparametric measure of rank correlation assessing monotonic relationships. "obtains >0.9. Spearman-rank on competitive programming submissions from APPS"
Static analysis: Techniques for analyzing program properties without executing the code. "with varying names such as performance prediction and static analysis."
T5Gemma: A pretrained T5-family encoder-decoder model variant used to initialize the RLM. "initialized from T5Gemma"
TorchInductor: A PyTorch compiler backend that generates optimized code (e.g., Triton kernels) for accelerators. "produced by TorchInductor."
Triton: A language and compiler for writing efficient GPU kernels at a higher level of abstraction. "the latency of Triton GPU kernels"
Triton kernel: A GPU compute kernel authored/compiled via Triton, typically optimized for tensor operations. "We profile each Triton kernel's latency"
XLA: Accelerated Linear Algebra; a compiler for optimizing linear algebra computations used by ML frameworks. "such as XLA."
Zero-cost proxies: Cheap-to-compute indicators that correlate with final model performance, used to speed up NAS. "zero-cost proxies (Abdelfattah et al., 2021)"
Zero-shot: Evaluating on tasks or instances without seeing any task-specific training examples. "making it impossible to predict the memory zero-shot (i.e. new question, new submission)"

View Paper Prompt View All Prompts

Continue Learning

Authors (5)

Collections

Tweets

This paper has been mentioned in 5 posts and received 210 likes.

HackerNews

Code-to-Metric Regression: Predicting Numeric Outcomes of Code Executions (2 points, 0 comments)

Regression Language Models for Code (2509.26476v1)

Summary

Regression LLMs for Code: Unified Code-to-Metric Prediction

Motivation and Problem Statement

Methodology

Model Architecture

Multi-Task and Multi-Objective Regression

Data Sources

Experimental Results

High-Level Code Regression

Kernel and Graph Regression

Ablation Studies

Implementation Considerations

Implications and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

What questions did the paper ask?

How did they do it?

What did they find?

Why does this matter?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Continue Learning

Related Papers

Authors (5)

Collections

Tweets

HackerNews

Don't miss out on important new AI/ML research