Progress Ratio Embeddings (PRE)

Updated 14 December 2025

Progress Ratio Embeddings (PRE) are continuous, trigonometric embeddings that encode the progress ratio of generated text for precise length control in Transformer decoders.
PRE replace discrete countdown signals with a smooth impatience signal, improving stability and generalization across various output lengths in tasks like summarization and question generation.
By integrating seamlessly into existing architectures with minimal modifications, PRE maintain high output quality and low error rates even on out-of-distribution target lengths.

Progress Ratio Embeddings (PRE) are continuous, trigonometric embeddings designed to provide robust and generalizable length control for neural text generation models, specifically those employing Transformer-based architectures. PRE operate by introducing a smoothly varying impatience signal, tied to a normalized progress ratio $r_t = t/\ell$ at each decoding step—where $t$ is the current token position and $\ell$ the user-specified target length. This approach replaces previous techniques relying on discrete countdown signals, offering improved stability, length fidelity, and generalization to unseen output lengths in sequence-to-sequence tasks such as abstractive summarization and question generation. PRE are injective with minimal architectural modification and have demonstrated effective control over text length without degrading output quality under standard evaluation metrics (Botcazou et al., 7 Dec 2025).

1. Motivation and Definition

PRE address the problem of explicit length planning in neural sequence generation. Traditional autoregressive decoders for tasks like summarization, question generation, and dialog typically lack mechanisms to precisely satisfy a user-specified output length $\ell$ , instead relying on stochastic EOS token prediction. Reverse Positional Embeddings (RPE) attempted to remedy this by injecting a fixed countdown signal ( $\ell-t$ ) at each decoding position, but exhibited poor generalization when the target length fell outside the training distribution. PRE propose a continuous signal: the progress ratio $r_t = t/\ell \in [0,1]$ . This ratio is used to generate a smoothly evolving impatience signal embedded into the decoder, indicating the fraction of output generated and promoting more reliable adherence to desired lengths.

2. Mathematical Formulation of PRE

For a decoding step $t$ (with $0 \leq t \leq \ell$ ), the PRE mechanism is instantiated as follows:

Progress ratio: $r_t = t / \ell$ .
Decoder input embedding:

$X_t = E_t + P_t + \xi(r_t)$

Here, $E_t$ is the token embedding, $P_t$ is the standard positional embedding, and $\xi(r_t) \in \mathbb{R}^{d_\text{model}}$ denotes the PRE vector.

PRE vector construction: Defining $\omega_r = M \cdot r$ with $M = d_\text{model}/2$ , for $j=1,...,d_\text{model}$ ,

$\xi(r)_j = \begin{cases} \cos\left(2 \omega_r \left\lfloor j/2 \right\rfloor / d_\text{model}\right), & \text{if %%%%15%%%% even} \ \sin\left(2 \omega_r \left\lfloor j/2 \right\rfloor / d_\text{model}\right), & \text{if %%%%16%%%% odd} \end{cases}$

Each consecutive (cos, sin) pair encodes a sinusoid whose frequency grows linearly in $r_t$ , producing a dense, continuous signature of generation progress.

3. Integration into Transformer Architectures

PRE are incorporated into standard encoder–decoder Transformer models by injecting $\xi(r_t)$ as part of the input embedding at every decoding step for every decoder layer. The core self-attention, cross-attention, feed-forward blocks, and output head remain unchanged. In inference, at each decoding step $t$ , the model calculates $r_t$ , forms $\xi(r_t)$ , and sums it with existing embeddings before decoding the next token. Decoding continues until EOS is predicted or the ratio saturates at $1$, discouraging generation beyond the requested length.

4. Training Objective and Ratio Noise Regularization

Models employing PRE are fine-tuned under teacher forcing to maximize conditional probabilities over reference sequences of target length $\ell$ , using the cross-entropy objective: $\mathcal{L}_\mathcal{B}(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \sum_{t=1}^{\ell_i} \log P_\theta(S^i_t \mid S^i_{<t}, A^i, \Xi^i_{≤t})$ To promote smooth interpolation and prevent overfitting to discrete $r_t$ values, Gaussian noise is injected into each ratio before embedding: $r \leftarrow \mathrm{Clip}\left( r + \frac{2 \delta}{d_\text{model}},\, 0,\, 1 \right), \quad \delta \sim \mathcal{N}(0, 1)$ This procedure exposes the model to a spectrum of $r$ values, enhancing generalization for arbitrary output lengths.

5. Comparative Analysis: PRE vs Reverse Positional Embeddings

RPE encode the countdown via,

$RPE(i,2k) = \sin\left( (\ell-i)/10000^{2k/d_\text{model}} \right),\quad RPE(i,2k+1) = \cos\left( (\ell-i)/10000^{2k/d_\text{model}} \right)$

This discrete representation leads to instability for out-of-training-distribution length requests: mean absolute error (MAE) spikes, and the number of large-error outliers rises significantly. In contrast, PRE's continuous embedding avoids discretization artifacts, complies with the Nyquist–Shannon criterion ( $F_s = d_\text{model}/2 \geq 2F_\text{max}$ ), and maintains stable behavior for all $\ell$ within model capacity.

Approach	Embedding Structure	Generalization (O.O.D ℓ)
RPE	Discrete countdown	Poor (outliers, MAE spikes)
PRE	Continuous impatience (PRE)	Robust (maintains low error, few outliers)

A plausible implication is that PRE’s mathematical structure inherently supports interpolation and generalization across arbitrary lengths, whereas RPE is constrained by the granularity of its countdown basis.

6. Empirical Validation and Results

Rigorous experiments on BART-L (400M, $d_\text{model}=1024$ ) and T5-Large (770M, $d_\text{model}=512$ ) were conducted for CNN/DailyMail and XSum summarization, as well as SQuAD question generation.

Length Fidelity (MAE ± SD):
- CNN/DM: No-control 19.2±17; RPE 1.6±3.6; PRE 0.5±0.3.
- XSum: No-control 5.8±5; RPE 0.7±1.1; PRE 0.1±0.2.
Content Quality (ROUGE/BERTScore):
- CNN/DM: PRE 45.3/21.9/42.2/69.8 vs RPE 44.5/21.2/41.3/69.4.
- XSum: PRE 45.2/21.3/36.4/72.7 vs RPE 44.5/20.8/35.6/72.3.
Out-of-Distribution Target Lengths:
- For $\ell > 300$ on CNN/DM, RPE outlier rate (>20-token error) exceeds 50% while PRE remains below 10% for $\ell$ up to 1000.
SQuAD Question Generation:
- MAE: PRE 0.0±0.1; RPE 0.8±3.6; baseline 3.12±3.3.

Gaussian ratio noise proved essential; ablation reveals its necessity for smooth interpolation. Statistical significance for PRE’s MAE improvement over baselines is $p \ll 1\mathrm{e}{-30}$ .

7. Limitations and Prospective Developments

Current PRE research targets encoder–decoder architectures exclusively. Application in large, decoder-only LLMs remains an open question. Its efficacy beyond summarization and question generation, for tasks such as dialog or code synthesis, is unknown. Integrating PRE into chain-of-thought reasoning to control inference depth may reduce hallucinations and computational cost. This suggests potential extensions into reasoning-intensive generation domains, contingent on future empirical validation.

In summary, Progress Ratio Embeddings (PRE) constitute a continuous, trigonometric impatience signal for robust sequence length control, generalizing across broad length distributions while preserving or enhancing text generation metrics and requiring minimal architectural modification (Botcazou et al., 7 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Progress Ratio Embeddings: An Impatience Signal for Robust Length Control in Neural Text Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Progress Ratio Embeddings (PRE).