Papers
Topics
Authors
Recent
2000 character limit reached

RUDDER: Residual-Update Decoding Regulation

Updated 20 November 2025
  • RUDDER is a framework using residual update redistribution to improve temporal credit assignment in reinforcement learning, enabling efficient learning in sparse or delayed reward scenarios.
  • It applies directed decoding regulation in large vision-language models to mitigate hallucination by adaptively steering token outputs with minimal computational overhead.
  • The approach provides theoretical convergence guarantees and leverages demonstration-based profiles to optimize hyperparameters for robust, practical applications.

Residual-Update Directed DEcoding Regulation (RUDDER) is a family of algorithmic frameworks based on residual update analysis and directed decoding mechanisms to steer learning or generation in sequential decision and generative models. RUDDER was first established in reinforcement learning for reward redistribution and deep credit assignment, enabling sample-efficient learning in environments with sparse or delayed rewards. It has since been adapted for token-level intervention in autoregressive large vision-LLMs (LVLMs) to mitigate object hallucination with minimal computational overhead.

1. Motivation and Problem Definition

RUDDER addresses the challenge of temporally misaligned signals in sequence-based tasks—particularly, the problem of sparse or delayed rewards in reinforcement learning, and the weak grounding of language generation in vision-LLMs.

  • Sparse/Delayed Reward RL: In many RL environments, the original Markov reward signals R~t\tilde R_{t} are only non-zero at episode termination. Standard actor-critic or value-based updates propagate reward information slowly, impeding credit assignment across long horizons.
  • LVLM Hallucination: In autoregressive generation settings, strong text priors can cause large vision-LLMs to reference objects inconsistent with the visual input. Direct output or hidden-state steering typically requires costly interventions across multiple forward passes.

RUDDER introduces residual-based reward redistribution and regulation techniques to address these issues in a computationally efficient manner (Patil et al., 2020, Holzleitner et al., 2020, Zou et al., 13 Nov 2025).

2. Core Algorithmic Principles

2.1. RL: Reward Redistribution via Return Decomposition

In the original RL formulation, RUDDER replaces standard value estimation with a return decomposition model. Rather than estimating value functions V(s)V(s) or Q(s,a)Q(s,a) directly, it trains a sequence model g((s,a)0:t)g\left((s,a)_{0:t}\right) to predict the expected return G~0=t=0TR~t+1\tilde G_0 = \sum_{t=0}^T \tilde R_{t+1}, given full or partial trajectories. The redistributed reward is

Rt+1=g((s,a)0:t)g((s,a)0:t1).R_{t+1} = g((s,a)_{0:t}) - g((s,a)_{0:t-1}).

This residual captures the increments in expected return, which correspond to critical subtask accomplishments. By construction, t=0TRt+1=G~0\sum_{t=0}^T R_{t+1} = \tilde G_0; the redistributed reward neither alters the optimal policy nor total return but drastically improves temporal credit assignment for downstream RL updates (Patil et al., 2020, Holzleitner et al., 2020).

2.2. LVLMs: Contextual Residual Steering

In vision-LLMs, RUDDER (as introduced in (Zou et al., 13 Nov 2025)) applies directed regulation at each decoding step based on the residual updates in transformer layers. A Contextual Activation Residual Direction (CARD) vector is computed during a single prefill pass:

  • For a chosen layer \ell^*, aggregate residual updates Δi=Ai\Delta_i^{\ell^*} = \mathbf{A}_i^{\ell^*} over the image and prompt tokens, normalize via pooling, yielding vCARD\mathbf{v}_{\mathrm{CARD}}.

Token-wise, a steering signal is injected adaptively, with its strength modulated by the alignment between the current hidden state and vCARD\mathbf{v}_{\mathrm{CARD}}.

3. Formal Frameworks and Update Equations

3.1. RL: Actor-Critic with Decomposed Return Critic

The RUDDER objective comprises two losses:

  • Actor (residual-update) loss:

Lh(θ,ω,z)=Eτπ˘[12t=0T(Rt+1(τ;θ)q^θ(st,at))2],L_h(\theta, \omega, z) = \mathbb{E}_{\tau \sim \breve{\pi}} \left[ \frac{1}{2} \sum_{t=0}^T (R_{t+1}(\tau; \theta) - \hat{q}_\theta(s_t, a_t))^2 \right],

where q^θ(st,at)\hat{q}_\theta(s_t, a_t) is the Q-estimate and Rt+1R_{t+1} is redistributed via g()g(\cdot).

  • Critic (directed-decoding) loss:

Lg(θ,ω,z)=Eτπθ[12(t=0TR~t+1g(τ;ω))2].L_g(\theta, \omega, z) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \frac{1}{2} (\sum_{t=0}^{T}\tilde{R}_{t+1} - g(\tau;\omega))^2 \right].

Parameters (θ,ω)(\theta, \omega) are updated on separate time scales (slow for actor, fast for critic), ensuring convergence to a local fixed point (θ,ω(θ))(\theta^*, \omega^*(\theta^*)) under standard technical assumptions (Holzleitner et al., 2020).

3.2. LVLMs: Adaptive Residual Update Injection

  • Given vCARD\mathbf{v}_{\mathrm{CARD}}, at generation step tt, compute

st=cos(h,t,vCARD)s_t = \cos(\mathbf{h}_{\ell^*, t}, \mathbf{v}_{\mathrm{CARD}})

where h,t\mathbf{h}_{\ell^*,t} is the hidden state at layer \ell^*.

  • Form a Beta-Bernoulli gate:

αt=softplus(kst+c),βt=softplus(kst+c).\alpha_t = \mathrm{softplus}(k\,s_t + c), \quad \beta_t = \mathrm{softplus}(-k\,s_t + c).

gt=αtαt+βt,gt=clip(gt;gmin,gmax),g_t = \frac{\alpha_t}{\alpha_t + \beta_t}, \quad g_t = \mathrm{clip}(g_t; g_{\min}, g_{\max}),

h,tnew=(h,t+SA(h,t))+1[tTans](αmaxgt)vCARD.\mathbf{h}_{\ell^*, t}^{\mathrm{new}} = (\mathbf{h}_{\ell^*, t} + \mathrm{SA}(\mathbf{h}_{\ell^*, t})) + \mathbf{1}[t \in \mathcal{T}_{\mathrm{ans}}]\, (\alpha_{\max} g_t) \mathbf{v}_{\mathrm{CARD}}.

This mechanism preserves fluency when the LLM is already grounded and only exerts a corrective influence during periods of visual-textual misalignment (Zou et al., 13 Nov 2025).

4. Advanced Variants: Align-RUDDER and Demonstration-Based Profiles

When only a limited set of expert demonstrations is available, fitting a deep sequence model (e.g., an LSTM) for gg is impractical. Align-RUDDER addresses this by deriving a profile model from multiple sequence alignments:

  • Event vocabulary: Cluster state-action deltas using successor representations and affinity propagation, typically yielding 20\sim 20 clusters; map each trajectory to an event sequence e0e1eTe_0e_1\ldots e_T.
  • Scoring matrix: Define si,j=1/pis_{i,j} = 1/p_i for i=ji=j (empirical frequency pip_i) and si,j=αs_{i,j} = \alpha (α<0\alpha<0) for iji \neq j.
  • Multiple sequence alignment (MSA): Use Clustal W with zero gap penalties to align demonstration event sequences, producing a consensus N×LN \times L matrix.
  • Profile (PSSM): For each profile column tt and event ii, compute frequency qi,tq_{i,t}, score

si,t=1λtln(qi,tpi),s_{i, t} = \frac{1}{\lambda_t} \ln \left(\frac{q_{i,t}}{p_i}\right),

with normalization λt\lambda_t ensuring ipieλtsi,t=1\sum_i p_i e^{\lambda_t s_{i,t}}=1.

  • Reward extraction: For new agent history τt\tau_t, align prefix to profile, compute cumulative score S(τt)S(\tau_t), and define

Rt+1=C(S(τt)S(τt1)),C=G~0/t(S(τt)S(τt1)).R_{t+1} = C (S(\tau_t) - S(\tau_{t-1})),\quad C = \overline{\tilde{G}_0} / \overline{\sum_t (S(\tau_t) - S(\tau_{t-1}))}.

A final correction, RT+2=G~0t=0TRt+1R_{T+2} = \tilde{G}_0 - \sum_{t=0}^T R_{t+1}, imposes exact reward equivalence. The profile can be constructed reliably with as few as 2–10 demonstrations, eliminating the need for costly deep model training in low-data regimes (Patil et al., 2020).

5. Empirical Performance and Benchmarks

Task Setup / Metric Align-RUDDER Data Level Demo Baselines RL Baselines Align-RUDDER Episodes to Target Baseline Episodes
Key–Chest Toy 32-step, 2-key retrieval 2/5/10 demos LSTM RUDDER \sim0.96 recall (profile) \sim0.46 (LSTM)
FourRooms 12x12, journey to portal 10 demos BC+Q, SQIL LSTM RUDDER \sim1,372 BC+Q: 7,624; RUDDER: 41,000
EightRooms 12x24, multiple doors 10 demos BC+Q, SQIL LSTM RUDDER \sim2,728 BC+Q: 14,992; RUDDER: 85,000
Minecraft ObtainDiamond 10 human demos Achieves diamond (0.1% freq) None ever succeed

RUDDER and Align-RUDDER yield acceleration of RL convergence in sparse-reward domains by several orders of magnitude compared to standard demonstration-based baselines. In Minecraft’s ObtainDiamond, Align-RUDDER is able to identify \sim31 sub-goal boundaries in successful demonstrations and structurally segment the agent policy, enabling the first successful demonstration-based end-to-end RL agent to reach the diamond purely via reward redistribution (Patil et al., 2020).

In vision-language settings, RUDDER-based residual steering achieves a 33.2% reduction in hallucinated caption rate (CHAIRS_S) and 28.6% reduction in hallucinated object count (CHAIRI_I), at \sim0.6 ms/token latency overhead, substantially outperforming prior methods in efficiency and on par in effectiveness (Zou et al., 13 Nov 2025).

6. Theoretical Guarantees and Analysis

RUDDER as an actor-critic method, with a return-decomposition critic and residual-update actor, admits local convergence guarantees under standard assumptions (episodic sampling, step size schedules, smoothness, boundedness):

  • Two-time-scale stochastic approximation theory yields almost-sure convergence of (θn,ωn)(\theta_n, \omega_n) to a locally optimal stationary point as nn \to \infty, provided the critic loss surface is sufficiently regular (Holzleitner et al., 2020).
  • The framework covers both RUDDER and Proximal Policy Optimization (PPO), as both fit under the general actor-critic formalism; RUDDER’s distinguishing element is its reward redistribution scheme.
  • Essential technical lemmas establish local uniqueness and smoothness of the critic solution, and guarantee that the actor converges to a deterministic optimal policy for finite greedification parameter β\beta.

A plausible implication is that RUDDER’s update structure generalizes to other domains with late-arriving signals, provided a suitable return-decomposition model can be constructed.

7. Practical Applications, Limitations, and Future Directions

RUDDER has demonstrated empirical impact on complex, temporally extended RL tasks with sparse rewards, and in production-level LVLMs where efficient and effective hallucination mitigation is required. In both domains, RUDDER achieves strong performance with minimal overhead, by reframing signal assignment problems in terms of residual update analysis.

Key practical considerations include:

  • In RL, Align-RUDDER’s reliance on demonstration-derived profiles enables robust reward redistribution with very limited expert trajectories.
  • In LVLMs, effectiveness depends on tuning hyperparameters: injection layer, maximum steering strength, gate sensitivity, and clamping range; model-specific calibration is generally required (Zou et al., 13 Nov 2025).
  • RUDDER’s main limitation is its sensitivity to the quality of the underlying residual/return-decomposition model (either LSTM predictor or profile alignment), and the necessity of demonstration data or a high-precision prefill pass.

Future work may automate hyperparameter optimization for LVLM steering, extend profile-based reward redistribution to more complicated demonstration sets, and explore theoretical links between return decomposition and information-theoretic attributions in sequence models.


References:

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Residual-Update Directed DEcoding Regulation (RUDDER).