Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reward-Augmented Maximum Likelihood (RAML)

Updated 25 February 2026
  • RAML is a structured prediction paradigm that integrates task-specific rewards into likelihood estimation via a temperature-controlled exponential weighting scheme.
  • It generalizes maximum likelihood by distributing probability mass over high-reward alternatives, thereby smoothing the traditional point-mass target.
  • RAML has shown empirical success in tasks like machine translation and speech recognition, motivating extensions such as VAML and entropy-regularized methods.

Reward-Augmented Maximum Likelihood (RAML) is a learning paradigm for structured prediction in which task-specific rewards are directly integrated into the statistical estimation process. RAML generalizes maximum likelihood estimation by defining a reward-smoothed target distribution over outputs, encouraging the model to place probability mass not only on the ground-truth label but also on other high-reward candidates. This technique is computationally efficient and empirically effective across structured prediction tasks—including neural sequence generation—where the end-goal is to maximize a task-defined reward rather than mere likelihood of reference structures. Theoretical advances have linked RAML to softmax Q-distribution estimation and Bayesian decision theory, providing foundations for its empirical success and connections to entropy regularized reinforcement learning (Ma et al., 2017, Norouzi et al., 2016, Dai et al., 2018).

1. RAML Objective and Formal Definition

Standard maximum likelihood (ML) for structured prediction minimizes: LML(θ)=i=1nlogPθ(yixi)=i=1nKL(P~(xi)Pθ(xi))\mathcal{L}_{\rm ML}(\theta) = -\sum_{i=1}^n \log P_\theta(y_i \mid x_i) = \sum_{i=1}^n \mathrm{KL}(\tilde P(\cdot\mid x_i) \| P_\theta(\cdot\mid x_i)) where P~\tilde P is the empirical distribution concentrated on yiy_i. RAML, in contrast, disperses this mass according to the task reward r(y,y)r(y, y^*), defining the exponentiated payoff distribution: qτ(yy)=exp(r(y,y)/τ)yYexp(r(y,y)/τ)q_\tau(y\mid y^*) = \frac{\exp(r(y, y^*)/\tau)}{\sum_{y' \in \mathcal Y} \exp(r(y', y^*)/\tau)} The RAML loss is then

LRAML(θ)=i=1nyYqτ(yyi)logPθ(yxi)=i=1nKL(qτ(yi)Pθ(xi))\mathcal{L}_{\rm RAML}(\theta) = -\sum_{i=1}^n \sum_{y\in\mathcal Y} q_\tau(y\mid y_i) \log P_\theta(y\mid x_i) = \sum_{i=1}^n \mathrm{KL}(q_\tau(\cdot\mid y_i) \| P_\theta(\cdot\mid x_i))

Here, τ>0\tau > 0 is a temperature parameter governing the spread of mass from yy^* to other outputs according to their reward. As τ0\tau \rightarrow 0, qτq_\tau collapses to a point mass on yy^*, recovering classical ML; larger τ\tau distributes probability more widely across high-reward alternatives (Norouzi et al., 2016, Ma et al., 2017, Dai et al., 2018).

2. Theoretical Foundations: Bayes Decision Theory and Softmax Q-Distributions

Bayes-optimal structured prediction seeks a mapping h(x)h^*(x) that maximizes the expected reward: h(x)=argmaxyYEP(YX=x)[r(y,Y)]h^*(x) = \arg\max_{y \in \mathcal Y} \mathbb{E}_{P(Y\mid X=x)} [ r(y, Y) ] The softmax Q-distribution provides a smooth approximation to the Bayes boundary: Qτ(Y=yX=x)=exp(EP(YX=x)[r(y,Y)]/τ)yexp(EP(YX=x)[r(y,Y)]/τ)Q_\tau(Y=y\mid X=x) = \frac{\exp\left(\mathbb{E}_{P(Y\mid X=x)}[r(y,Y)]/\tau\right)}{\sum_{y'} \exp\left(\mathbb{E}_{P(Y\mid X=x)} [ r(y', Y) ]/\tau \right)} Decoding from Qτ(yx)Q_\tau(y\mid x) recovers the Bayes decision rule for the supervised reward maximization problem. RAML is equivalent to maximizing the likelihood of samples from an empirical softmax QQ-distribution, which smooths out label noise and bridges the gap between ML and Bayes-optimal risk minimization (Ma et al., 2017).

3. RAML as Approximate Softmax Q-Distribution Estimation

Directly estimating QτQ_\tau is intractable due to the expectation in the exponent. RAML uses a tractable surrogate: Qτ(yx)=EP(Yx)[exp(r(y,Y)/τ)yexp(r(y,Y)/τ)]Q'_\tau(y\mid x) = \mathbb{E}_{P(Y \mid x)} \left[ \frac{\exp(r(y, Y)/\tau)}{\sum_{y'} \exp(r(y', Y)/\tau)} \right] The empirical version, with the true distribution replaced by the empirical data distribution, returns to the RAML payoff qτ(yy)q_\tau(y\mid y^*). The minimizer of the KL between QτQ_\tau and the model distribution PθP_\theta coincides with the RAML estimator up to controlled approximation error. Specifically, the KL-divergence between QτQ_\tau and QτQ'_\tau can be bounded in terms of RmaxR_{\max} and τ\tau, and additional conditions allow this bound to shrink for small τ\tau (Ma et al., 2017).

4. Relation to Entropy-Regularized Reinforcement Learning

RAML shares a formal connection to entropy-regularized RL, particularly in the case of sequence prediction. The constructed exponentiated payoff distribution is analogous to the policy in a maximum-entropy Markov Decision Process (MDP), where the entropy penalty encourages distributional smoothness over possible outputs. Recent work has established that the token-level RAML target can be equivalently derived from the optimal soft Bellman recursion for a reward-MDP with entropy regularization, linking RAML to the broader RL landscape and suggesting improved variants such as Value-Augmented Maximum Likelihood (VAML) (Dai et al., 2018).

5. Algorithmic Implementation and Practical Aspects

A typical RAML optimization loop proceeds as follows:

  1. For each training pair (x,y)(x, y^*), sample candidate outputs yqτ(y)y \sim q_\tau(\cdot \mid y^*).
  2. Compute cross-entropy loss logPθ(yx)-\log P_\theta(y\mid x) at the sampled yy.
  3. Update parameters θ\theta with the gradient averaged over the batch.

Sampling from qτq_\tau may use edit-distance-based proposals (for speech or translation), perturbing yy^* at various distances according to their exponentiated reward. The process can be efficiently implemented, with low variance Monte Carlo estimates for the gradient. The temperature τ\tau is a crucial hyperparameter; best performance in practice is usually achieved with τ[0.7,1.0]\tau \in [0.7,1.0] (Norouzi et al., 2016). RAML preserves fast training characteristics of ML, with minimal overhead and no need for online inference as in policy-gradient RL.

6. Empirical Performance and Comparative Analysis

Empirically, RAML demonstrates consistent improvements over standard ML across a range of structured prediction tasks. On speech recognition (TIMIT), RAML with τ=0.9\tau=0.9 reduced phone error rate from 22.18% (ML) to 19.89%. On WMT'14 En\rightarrowFr translation, BLEU improved from 36.87 (ML) to 37.23 (RAML, τ=0.85\tau=0.85). In multi-reference captioning (MSCOCO), both RAML and softmax Q-distribution methods outperformed ML on BLEU metrics. Across named entity recognition, dependency parsing, and machine translation, RAML improves performance on relevant reward metrics while retaining ML's computational advantages (Norouzi et al., 2016, Ma et al., 2017, Dai et al., 2018).

A summary of empirical BLEU scores on machine translation and image captioning (test set, ± std) from (Dai et al., 2018):

System WMT'14 DE→EN (input-feeding) COCO Captioning
MLE 28.06 ± 0.15 29.54 ± 0.21
RAML 28.56 ± 0.15 29.84 ± 0.21
VAML 28.84 ± 0.10 29.93 ± 0.22
ERAC 29.31 ± 0.04 31.44 ± 0.22

Note: VAML and ERAC are successor algorithms leveraging further RL-inspired constructs.

7. Extensions, Limitations, and Future Directions

RAML has inspired a family of related algorithms including token-level RAML, Value-Augmented Maximum Likelihood (VAML), and entropy-regularized actor-critic (ERAC). While RAML uses sequence-level reward, VAML propagates reward to token-level supervision using a learned Q-function, improving sample efficiency. ERAC introduces systematic entropy regularization into the actor-critic framework for sequence modeling, consistently outperforming previous baselines (Dai et al., 2018).

A limitation of the RAML framework is the approximation error arising from empirical surrogates to the ideal softmax Q-distribution. The effectiveness of downstream algorithms may depend on the quality of value estimation and proposal distributions used for candidate sampling. Future directions may include leveraging stronger off-policy RL methods and improved value function approximators to better align learning with the true Bayes-optimal rule, as well as generalizing RAML to additional structure classes and reward functions (Ma et al., 2017, Dai et al., 2018).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reward-Augmented Maximum Likelihood (RAML).