Response Length Predictor (RLP)

Updated 29 December 2025

RLP is a predictive model that estimates the total or remaining output tokens of a LLM based on prompts and partial outputs.
It employs various architectures—including encoder-based regression, token-classification, and statistical distributional models—to balance accuracy and inference speed.
Accurate token count predictions drive efficient LLM scheduling and batching, reducing latency and improving resource allocation in high-demand scenarios.

A Response Length Predictor (RLP) is a predictive model designed to estimate the number of output tokens that a LLM will generate for a given prompt, often conditioning on the prompt, model configuration, and partial generations. RLPs are critical system modules in LLM serving and inference frameworks where response length estimates drive scheduling, batching, and resource allocation strategies under strict performance constraints or for cost efficiency. Their construction, deployment, and evaluation draw on methodologies spanning regression, classification, probabilistic modeling, and resource-aware optimization.

1. Formal Task Definition and Problem Scope

The core function of the Response Length Predictor is to estimate, given a prompt $p$ (and optionally partial outputs $s_t$ at decoding step $t$ ), the eventual or remaining number of output tokens $L$ that an auto-regressive LLM will produce. There are multiple formalizations based on the setting:

One-shot prediction (prior to decoding): estimate total response length $L$ from $p$ alone (Perez-Ramirez et al., 22 May 2025, Fan et al., 26 Dec 2025, Zheng et al., 2023).
Iterative/stepwise prediction: at each decoding iteration $t$ , estimate the number of tokens $R_t^*$ remaining given $p$ and $s_t$ (Choi et al., 14 May 2025).
Distributional prediction: predict not just a point (mean/median) but a full distribution or quantiles of response length, reflecting the non-determinism of LLM outputs (Perez-Ramirez et al., 22 May 2025).

Formally, for a prompt $p$ and (optionally) a partially decoded output $s_t$ , an RLP implements

$f_\theta( p, s_t ) \approx \mathbb{E}[R_t^*],$

where $R_t^*$ is the true number of remaining tokens beyond step $t$ . In one-shot settings,

$f_\theta(p) \approx \mathbb{E}[L],$

with $L$ the total eventual response length.

2. Model Architectures and Input Representations

RLP architectures fall into two primary classes:

a) Encoder-based Regression Models

ELIS/ISRTF RLP: Uses the pre-trained BGE encoder to process the concatenated prompt $p$ and the latest generation window $s_t$ into token embeddings $E\in\mathbb{R}^{L\times d}$ . Mean pooling ( $h^0$ ) produces a fixed-dimensional vector, followed by an 8-layer MLP predicting the remaining token count $\hat R$ . Encoder parameters are frozen; only the MLP head is trained, minimizing inference overhead (Choi et al., 14 May 2025).

b) Token-Classification/Length-Binned Models

TimeBill RLP: Employs a small auto-regressive transformer (SLM), encoding the prompt, with a classification head to predict the length bin (bucketed by size $B$ ; best accuracy at $B=16$ ) (Fan et al., 26 Dec 2025).
Vicuna-7B RLP (Sequence Scheduling): Places a small head (e.g., mean pooling + bin classifier) atop a LLaMA-based model, instruction-tuned for length estimation. Only lightweight adapters are trained (LoRA on Q/K) (Zheng et al., 2023).

c) Statistical/Distributional Models

CASTILLO Blueprint: Models the length distribution for each $\langle$ prompt, model $\rangle$ pair as Gaussian, mixture, or empirically via ECDF, using features such as prompt embeddings, model ID, and decoding configuration (Perez-Ramirez et al., 22 May 2025).

The following summarizes representative model architectures:

System	Encoder	Output Head	Prediction Type
ELIS	BGE Transformer	8-layer MLP	Regression (tokens remaining)
TimeBill	Auto-regressive SLM	Linear classifier	Fine-grained bucketed class
Vicuna-7B	LLaMA/Vicuna-7B	Pool + bin classifier	Length-bin (50 words/bin)
CASTILLO	DistilBERT/MLP	Regressor/Distribution	μ, σ, distribution

3. Datasets, Labeling, and Training Procedures

RLP development depends on the availability of large-scale, diverse corpora of prompt-response pairs. Approaches include:

Sampling multiple completions per prompt under fixed generation settings, then computing statistical summaries (mean, std, quantiles) for label construction (Perez-Ramirez et al., 22 May 2025).
Deterministic single-completion datasets (e.g., vLLM runs of 13 LLMs on 11,000 prompts, totaling 143,000 samples in ELIS) (Choi et al., 14 May 2025).
Bucketed labeling by grouping observed lengths into bins (e.g., in TimeBill and LLM-Empowered pipelines) (Fan et al., 26 Dec 2025, Zheng et al., 2023).
Iterative slicing: dividing full generations into windows of fixed size ( $K=50$ ) for per-step regression targets (remaining length after each chunk) (Choi et al., 14 May 2025).

Training regimes typically employ:

Losses: Mean squared error (MSE) for regression, cross-entropy for classification, or negative log-likelihood for full-distribution predictions.
Splitting: Conventional 60/20/20 or 70/20/10 train/val/test splits, with stratification where appropriate for prompt/task type (Perez-Ramirez et al., 22 May 2025, Choi et al., 14 May 2025).
Parameter freezing: In resource-constrained/inference-optimized settings, only top-level (MLP or classifier) parameters are updated, freezing the heavy encoder backbone (Choi et al., 14 May 2025).
Label computation scripts and pseudocode are published, e.g., CASTILLO’s Algorithm 1 for per-prompt statistics (Perez-Ramirez et al., 22 May 2025).

4. Evaluation Metrics and Empirical Performance

RLPs are evaluated using metrics standard to regression and classification:

Mean Absolute Error (MAE): Average absolute difference between predicted and true lengths.
Root Mean Squared Error (RMSE): Euclidean error in token (or word) counts.
$R^2$ score: Proportion of variance explained.
Calibration and distributional metrics: KL divergence and Earth Mover’s Distance when full distributions are predicted (Perez-Ramirez et al., 22 May 2025).

Key results include:

RLP Approach (Dataset/Setting)	MAE ↓	RMSE ↓	$R^2$ ↑
ELIS/ISRTF RLP (test, vLLM)	19.9	34.3	0.852
ELIS/ISRTF RLP (LMSYS, OOD)	71.5	101.3	0.480
TimeBill (512 buckets)	42.7	78.1	0.723
BERT-based ProxyModel (5-way)	105.7	136.8	0.152
DistilBERT-based S $^3$ (10-way)	109.0	148.9	–0.004
CASTILLO Statistical Baseline	task-dependent; see (Perez-Ramirez et al., 22 May 2025)

Notably, iterative refinement—feeding partial generations into the model—improves accuracy as more of the output is revealed (Choi et al., 14 May 2025). Bucketing further improves class-based methods up to hundreds of bins (bucket size $B=16$ , 512 bins in TimeBill offers best trade-off) (Fan et al., 26 Dec 2025).

5. Integration with LLM Serving and Scheduling

RLP outputs fundamentally drive system-level LLM optimizations:

ISRTF Scheduling (ELIS): RLP provides per-job remaining length estimates at each decoding window, allowing the ISRTF (Iterative Shortest Remaining-Time First) scheduler to prioritize jobs with lower predicted remaining tokens—minimizing head-of-line blocking and reducing average job completion time by up to 19.6%, with $<0.2\%$ latency overhead (Choi et al., 14 May 2025).
Sequence Scheduling with Batch Grouping: Estimated response lengths are used to bin and sort incoming requests, creating micro-batches of similar lengths for efficient computation. Variable batch sizes and a failure collection/recompute scheme handle inaccurate predictions, boosting throughput by up to 86% on A100 hardware (Zheng et al., 2023).
Time-Budgeted Decoding (TimeBill): RLP predicts total output length for a prompt pre-generation, enabling downstream time estimators to choose optimal key-value cache eviction ratios, thus satisfying given time budgets in latency-sensitive operational settings (Fan et al., 26 Dec 2025).

These applications rely on maintaining low RLP inference overhead (e.g., $\sim$ 11 ms for BGE-based encoders (Choi et al., 14 May 2025)) and tight integration with GPU scheduling stacks or cloud-native schedulers (e.g., Kubernetes).

6. Statistical Characterization and Modeling Considerations

Accurate RLPs must contend with substantial inter- and intra-model response length variability, nontrivial dependence on prompt semantics, and occasional partial degenerations during text generation (Perez-Ramirez et al., 22 May 2025):

Inter-Model Variability: Different LLMs produce widely varying mean lengths for identical prompts, with disparity frequently exceeding hundreds of tokens.
Intra-Model Variability: For a fixed $\langle$ prompt, model $\rangle$ , the standard deviation is routinely $20$–$100$ tokens (up to $45\%$ of the mean), necessitating distributional (not just point) prediction.
Prompt and decoding configuration dependency: Length is sensitive to prompt structure, dataset origin, and decoding hyper-parameters ( $T$ , top- $k$ , top- $p$ ).
Degeneration events: CASTILLO identifies and isolates cases where partial outputs saturate max-length constraints or where sample-to-sample variability is extreme.

Modeling approaches range from linear regression and quantile regression to count models (Poisson and negative-binomial) and mixture or nonparametric methods. Distributional outputs (e.g., predicting full empirical CDF) are increasingly necessary for robust system-level decision making (Perez-Ramirez et al., 22 May 2025).

7. System Implications, Limitations, and Prospective Directions

Response Length Predictors are purely software-level modules that reorder or batch LLM requests without modifying the LLM’s internal attention kernels or weights, rendering them orthogonal to lower-level optimizations such as quantization or custom attention mechanisms (Zheng et al., 2023). Notably, the gains from RLP-driven scheduling are additive to those from such methods.

Current best practices recommend:

Iterative refinement where possible, for maximal prediction sharpness.
Fine-grained binning or regression heads tailored to the target LLM and dataset.
Architecture choices (frozen versus fine-tuned encoders) guided by operational latency constraints.
Distributional prediction if full risk-aware scheduling is required.

Open research directions include robust handling of degeneration cases, adaptation to non-stationary or adversarial prompt distributions, and joint RLP-ETE optimization for closed-loop system constraints (Fan et al., 26 Dec 2025, Perez-Ramirez et al., 22 May 2025). CASTILLO’s empirical benchmarking framework provides standardized procedure and metric reporting for future RLP advances.

For in-depth methodology and implementation specifics, see (Choi et al., 14 May 2025, Fan et al., 26 Dec 2025, Perez-Ramirez et al., 22 May 2025), and (Zheng et al., 2023).