Papers
Topics
Authors
Recent
2000 character limit reached

Progress Ratio Embeddings (PRE)

Updated 14 December 2025
  • Progress Ratio Embeddings (PRE) are continuous, trigonometric embeddings that encode the progress ratio of generated text for precise length control in Transformer decoders.
  • PRE replace discrete countdown signals with a smooth impatience signal, improving stability and generalization across various output lengths in tasks like summarization and question generation.
  • By integrating seamlessly into existing architectures with minimal modifications, PRE maintain high output quality and low error rates even on out-of-distribution target lengths.

Progress Ratio Embeddings (PRE) are continuous, trigonometric embeddings designed to provide robust and generalizable length control for neural text generation models, specifically those employing Transformer-based architectures. PRE operate by introducing a smoothly varying impatience signal, tied to a normalized progress ratio rt=t/r_t = t/\ell at each decoding step—where tt is the current token position and \ell the user-specified target length. This approach replaces previous techniques relying on discrete countdown signals, offering improved stability, length fidelity, and generalization to unseen output lengths in sequence-to-sequence tasks such as abstractive summarization and question generation. PRE are injective with minimal architectural modification and have demonstrated effective control over text length without degrading output quality under standard evaluation metrics (Botcazou et al., 7 Dec 2025).

1. Motivation and Definition

PRE address the problem of explicit length planning in neural sequence generation. Traditional autoregressive decoders for tasks like summarization, question generation, and dialog typically lack mechanisms to precisely satisfy a user-specified output length \ell, instead relying on stochastic EOS token prediction. Reverse Positional Embeddings (RPE) attempted to remedy this by injecting a fixed countdown signal (t\ell-t) at each decoding position, but exhibited poor generalization when the target length fell outside the training distribution. PRE propose a continuous signal: the progress ratio rt=t/[0,1]r_t = t/\ell \in [0,1]. This ratio is used to generate a smoothly evolving impatience signal embedded into the decoder, indicating the fraction of output generated and promoting more reliable adherence to desired lengths.

2. Mathematical Formulation of PRE

For a decoding step tt (with 0t0 \leq t \leq \ell), the PRE mechanism is instantiated as follows:

  • Progress ratio: rt=t/r_t = t / \ell.
  • Decoder input embedding:

Xt=Et+Pt+ξ(rt)X_t = E_t + P_t + \xi(r_t)

Here, EtE_t is the token embedding, PtP_t is the standard positional embedding, and ξ(rt)Rdmodel\xi(r_t) \in \mathbb{R}^{d_\text{model}} denotes the PRE vector.

  • PRE vector construction: Defining ωr=Mr\omega_r = M \cdot r with M=dmodel/2M = d_\text{model}/2, for j=1,...,dmodelj=1,...,d_\text{model},

$\xi(r)_j = \begin{cases} \cos\left(2 \omega_r \left\lfloor j/2 \right\rfloor / d_\text{model}\right), & \text{if %%%%15%%%% even} \ \sin\left(2 \omega_r \left\lfloor j/2 \right\rfloor / d_\text{model}\right), & \text{if %%%%16%%%% odd} \end{cases}$

Each consecutive (cos, sin) pair encodes a sinusoid whose frequency grows linearly in rtr_t, producing a dense, continuous signature of generation progress.

3. Integration into Transformer Architectures

PRE are incorporated into standard encoder–decoder Transformer models by injecting ξ(rt)\xi(r_t) as part of the input embedding at every decoding step for every decoder layer. The core self-attention, cross-attention, feed-forward blocks, and output head remain unchanged. In inference, at each decoding step tt, the model calculates rtr_t, forms ξ(rt)\xi(r_t), and sums it with existing embeddings before decoding the next token. Decoding continues until EOS is predicted or the ratio saturates at $1$, discouraging generation beyond the requested length.

4. Training Objective and Ratio Noise Regularization

Models employing PRE are fine-tuned under teacher forcing to maximize conditional probabilities over reference sequences of target length \ell, using the cross-entropy objective: LB(θ)=1mi=1mt=1ilogPθ(StiS<ti,Ai,Ξti)\mathcal{L}_\mathcal{B}(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \sum_{t=1}^{\ell_i} \log P_\theta(S^i_t \mid S^i_{<t}, A^i, \Xi^i_{≤t}) To promote smooth interpolation and prevent overfitting to discrete rtr_t values, Gaussian noise is injected into each ratio before embedding: rClip(r+2δdmodel,0,1),δN(0,1)r \leftarrow \mathrm{Clip}\left( r + \frac{2 \delta}{d_\text{model}},\, 0,\, 1 \right), \quad \delta \sim \mathcal{N}(0, 1) This procedure exposes the model to a spectrum of rr values, enhancing generalization for arbitrary output lengths.

5. Comparative Analysis: PRE vs Reverse Positional Embeddings

RPE encode the countdown via,

RPE(i,2k)=sin((i)/100002k/dmodel),RPE(i,2k+1)=cos((i)/100002k/dmodel)RPE(i,2k) = \sin\left( (\ell-i)/10000^{2k/d_\text{model}} \right),\quad RPE(i,2k+1) = \cos\left( (\ell-i)/10000^{2k/d_\text{model}} \right)

This discrete representation leads to instability for out-of-training-distribution length requests: mean absolute error (MAE) spikes, and the number of large-error outliers rises significantly. In contrast, PRE's continuous embedding avoids discretization artifacts, complies with the Nyquist–Shannon criterion (Fs=dmodel/22FmaxF_s = d_\text{model}/2 \geq 2F_\text{max}), and maintains stable behavior for all \ell within model capacity.

Approach Embedding Structure Generalization (O.O.D ℓ)
RPE Discrete countdown Poor (outliers, MAE spikes)
PRE Continuous impatience (PRE) Robust (maintains low error, few outliers)

A plausible implication is that PRE’s mathematical structure inherently supports interpolation and generalization across arbitrary lengths, whereas RPE is constrained by the granularity of its countdown basis.

6. Empirical Validation and Results

Rigorous experiments on BART-L (400M, dmodel=1024d_\text{model}=1024) and T5-Large (770M, dmodel=512d_\text{model}=512) were conducted for CNN/DailyMail and XSum summarization, as well as SQuAD question generation.

  • Length Fidelity (MAE ± SD):
    • CNN/DM: No-control 19.2±17; RPE 1.6±3.6; PRE 0.5±0.3.
    • XSum: No-control 5.8±5; RPE 0.7±1.1; PRE 0.1±0.2.
  • Content Quality (ROUGE/BERTScore):
    • CNN/DM: PRE 45.3/21.9/42.2/69.8 vs RPE 44.5/21.2/41.3/69.4.
    • XSum: PRE 45.2/21.3/36.4/72.7 vs RPE 44.5/20.8/35.6/72.3.
  • Out-of-Distribution Target Lengths:
    • For >300\ell > 300 on CNN/DM, RPE outlier rate (>20-token error) exceeds 50% while PRE remains below 10% for \ell up to 1000.
  • SQuAD Question Generation:
    • MAE: PRE 0.0±0.1; RPE 0.8±3.6; baseline 3.12±3.3.

Gaussian ratio noise proved essential; ablation reveals its necessity for smooth interpolation. Statistical significance for PRE’s MAE improvement over baselines is p1e30p \ll 1\mathrm{e}{-30}.

7. Limitations and Prospective Developments

Current PRE research targets encoder–decoder architectures exclusively. Application in large, decoder-only LLMs remains an open question. Its efficacy beyond summarization and question generation, for tasks such as dialog or code synthesis, is unknown. Integrating PRE into chain-of-thought reasoning to control inference depth may reduce hallucinations and computational cost. This suggests potential extensions into reasoning-intensive generation domains, contingent on future empirical validation.

In summary, Progress Ratio Embeddings (PRE) constitute a continuous, trigonometric impatience signal for robust sequence length control, generalizing across broad length distributions while preserving or enhancing text generation metrics and requiring minimal architectural modification (Botcazou et al., 7 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Progress Ratio Embeddings (PRE).