RUDDER: Residual-Update Decoding Regulation

Updated 20 November 2025

RUDDER is a framework using residual update redistribution to improve temporal credit assignment in reinforcement learning, enabling efficient learning in sparse or delayed reward scenarios.
It applies directed decoding regulation in large vision-language models to mitigate hallucination by adaptively steering token outputs with minimal computational overhead.
The approach provides theoretical convergence guarantees and leverages demonstration-based profiles to optimize hyperparameters for robust, practical applications.

Residual-Update Directed DEcoding Regulation (RUDDER) is a family of algorithmic frameworks based on residual update analysis and directed decoding mechanisms to steer learning or generation in sequential decision and generative models. RUDDER was first established in reinforcement learning for reward redistribution and deep credit assignment, enabling sample-efficient learning in environments with sparse or delayed rewards. It has since been adapted for token-level intervention in autoregressive large vision-LLMs (LVLMs) to mitigate object hallucination with minimal computational overhead.

1. Motivation and Problem Definition

RUDDER addresses the challenge of temporally misaligned signals in sequence-based tasks—particularly, the problem of sparse or delayed rewards in reinforcement learning, and the weak grounding of language generation in vision-LLMs.

Sparse/Delayed Reward RL: In many RL environments, the original Markov reward signals $\tilde R_{t}$ are only non-zero at episode termination. Standard actor-critic or value-based updates propagate reward information slowly, impeding credit assignment across long horizons.
LVLM Hallucination: In autoregressive generation settings, strong text priors can cause large vision-LLMs to reference objects inconsistent with the visual input. Direct output or hidden-state steering typically requires costly interventions across multiple forward passes.

RUDDER introduces residual-based reward redistribution and regulation techniques to address these issues in a computationally efficient manner (Patil et al., 2020, Holzleitner et al., 2020, Zou et al., 13 Nov 2025).

2. Core Algorithmic Principles

2.1. RL: Reward Redistribution via Return Decomposition

In the original RL formulation, RUDDER replaces standard value estimation with a return decomposition model. Rather than estimating value functions $V(s)$ or $Q(s,a)$ directly, it trains a sequence model $g\left((s,a)_{0:t}\right)$ to predict the expected return $\tilde G_0 = \sum_{t=0}^T \tilde R_{t+1}$ , given full or partial trajectories. The redistributed reward is

$R_{t+1} = g((s,a)_{0:t}) - g((s,a)_{0:t-1}).$

This residual captures the increments in expected return, which correspond to critical subtask accomplishments. By construction, $\sum_{t=0}^T R_{t+1} = \tilde G_0$ ; the redistributed reward neither alters the optimal policy nor total return but drastically improves temporal credit assignment for downstream RL updates (Patil et al., 2020, Holzleitner et al., 2020).

2.2. LVLMs: Contextual Residual Steering

In vision-LLMs, RUDDER (as introduced in (Zou et al., 13 Nov 2025)) applies directed regulation at each decoding step based on the residual updates in transformer layers. A Contextual Activation Residual Direction (CARD) vector is computed during a single prefill pass:

For a chosen layer $\ell^*$ , aggregate residual updates $\Delta_i^{\ell^*} = \mathbf{A}_i^{\ell^*}$ over the image and prompt tokens, normalize via pooling, yielding $\mathbf{v}_{\mathrm{CARD}}$ .

Token-wise, a steering signal is injected adaptively, with its strength modulated by the alignment between the current hidden state and $\mathbf{v}_{\mathrm{CARD}}$ .

3. Formal Frameworks and Update Equations

3.1. RL: Actor-Critic with Decomposed Return Critic

The RUDDER objective comprises two losses:

Actor (residual-update) loss:

$L_h(\theta, \omega, z) = \mathbb{E}_{\tau \sim \breve{\pi}} \left[ \frac{1}{2} \sum_{t=0}^T (R_{t+1}(\tau; \theta) - \hat{q}_\theta(s_t, a_t))^2 \right],$

where $\hat{q}_\theta(s_t, a_t)$ is the Q-estimate and $R_{t+1}$ is redistributed via $g(\cdot)$ .

Critic (directed-decoding) loss:

$L_g(\theta, \omega, z) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \frac{1}{2} (\sum_{t=0}^{T}\tilde{R}_{t+1} - g(\tau;\omega))^2 \right].$

Parameters $(\theta, \omega)$ are updated on separate time scales (slow for actor, fast for critic), ensuring convergence to a local fixed point $(\theta^*, \omega^*(\theta^*))$ under standard technical assumptions (Holzleitner et al., 2020).

3.2. LVLMs: Adaptive Residual Update Injection

Given $\mathbf{v}_{\mathrm{CARD}}$ , at generation step $t$ , compute

$s_t = \cos(\mathbf{h}_{\ell^*, t}, \mathbf{v}_{\mathrm{CARD}})$

where $\mathbf{h}_{\ell^*,t}$ is the hidden state at layer $\ell^*$ .

Form a Beta-Bernoulli gate:

$\alpha_t = \mathrm{softplus}(k\,s_t + c), \quad \beta_t = \mathrm{softplus}(-k\,s_t + c).$

$g_t = \frac{\alpha_t}{\alpha_t + \beta_t}, \quad g_t = \mathrm{clip}(g_t; g_{\min}, g_{\max}),$

Inject a scaled steering vector (only for answer tokens):

$\mathbf{h}_{\ell^*, t}^{\mathrm{new}} = (\mathbf{h}_{\ell^*, t} + \mathrm{SA}(\mathbf{h}_{\ell^*, t})) + \mathbf{1}[t \in \mathcal{T}_{\mathrm{ans}}]\, (\alpha_{\max} g_t) \mathbf{v}_{\mathrm{CARD}}.$

This mechanism preserves fluency when the LLM is already grounded and only exerts a corrective influence during periods of visual-textual misalignment (Zou et al., 13 Nov 2025).

4. Advanced Variants: Align-RUDDER and Demonstration-Based Profiles

When only a limited set of expert demonstrations is available, fitting a deep sequence model (e.g., an LSTM) for $g$ is impractical. Align-RUDDER addresses this by deriving a profile model from multiple sequence alignments:

Event vocabulary: Cluster state-action deltas using successor representations and affinity propagation, typically yielding $\sim 20$ clusters; map each trajectory to an event sequence $e_0e_1\ldots e_T$ .
Scoring matrix: Define $s_{i,j} = 1/p_i$ for $i=j$ (empirical frequency $p_i$ ) and $s_{i,j} = \alpha$ ( $\alpha<0$ ) for $i \neq j$ .
Multiple sequence alignment (MSA): Use Clustal W with zero gap penalties to align demonstration event sequences, producing a consensus $N \times L$ matrix.
Profile (PSSM): For each profile column $t$ and event $i$ , compute frequency $q_{i,t}$ , score

$s_{i, t} = \frac{1}{\lambda_t} \ln \left(\frac{q_{i,t}}{p_i}\right),$

with normalization $\lambda_t$ ensuring $\sum_i p_i e^{\lambda_t s_{i,t}}=1$ .

Reward extraction: For new agent history $\tau_t$ , align prefix to profile, compute cumulative score $S(\tau_t)$ , and define

$R_{t+1} = C (S(\tau_t) - S(\tau_{t-1})),\quad C = \overline{\tilde{G}_0} / \overline{\sum_t (S(\tau_t) - S(\tau_{t-1}))}.$

A final correction, $R_{T+2} = \tilde{G}_0 - \sum_{t=0}^T R_{t+1}$ , imposes exact reward equivalence. The profile can be constructed reliably with as few as 2–10 demonstrations, eliminating the need for costly deep model training in low-data regimes (Patil et al., 2020).

5. Empirical Performance and Benchmarks

Task	Setup / Metric	Align-RUDDER Data Level	Demo Baselines	RL Baselines	Align-RUDDER Episodes to Target	Baseline Episodes
Key–Chest Toy	32-step, 2-key retrieval	2/5/10 demos	—	LSTM RUDDER	$\sim$ 0.96 recall (profile)	$\sim$ 0.46 (LSTM)
FourRooms	12x12, journey to portal	10 demos	BC+Q, SQIL	LSTM RUDDER	$\sim$ 1,372	BC+Q: 7,624; RUDDER: 41,000
EightRooms	12x24, multiple doors	10 demos	BC+Q, SQIL	LSTM RUDDER	$\sim$ 2,728	BC+Q: 14,992; RUDDER: 85,000
Minecraft ObtainDiamond	10 human demos	—	—	—	Achieves diamond (0.1% freq)	None ever succeed

RUDDER and Align-RUDDER yield acceleration of RL convergence in sparse-reward domains by several orders of magnitude compared to standard demonstration-based baselines. In Minecraft’s ObtainDiamond, Align-RUDDER is able to identify $\sim$ 31 sub-goal boundaries in successful demonstrations and structurally segment the agent policy, enabling the first successful demonstration-based end-to-end RL agent to reach the diamond purely via reward redistribution (Patil et al., 2020).

In vision-language settings, RUDDER-based residual steering achieves a 33.2% reduction in hallucinated caption rate (CHAIR $_S$ ) and 28.6% reduction in hallucinated object count (CHAIR $_I$ ), at $\sim$ 0.6 ms/token latency overhead, substantially outperforming prior methods in efficiency and on par in effectiveness (Zou et al., 13 Nov 2025).

6. Theoretical Guarantees and Analysis

RUDDER as an actor-critic method, with a return-decomposition critic and residual-update actor, admits local convergence guarantees under standard assumptions (episodic sampling, step size schedules, smoothness, boundedness):

Two-time-scale stochastic approximation theory yields almost-sure convergence of $(\theta_n, \omega_n)$ to a locally optimal stationary point as $n \to \infty$ , provided the critic loss surface is sufficiently regular (Holzleitner et al., 2020).
The framework covers both RUDDER and Proximal Policy Optimization (PPO), as both fit under the general actor-critic formalism; RUDDER’s distinguishing element is its reward redistribution scheme.
Essential technical lemmas establish local uniqueness and smoothness of the critic solution, and guarantee that the actor converges to a deterministic optimal policy for finite greedification parameter $\beta$ .

A plausible implication is that RUDDER’s update structure generalizes to other domains with late-arriving signals, provided a suitable return-decomposition model can be constructed.

7. Practical Applications, Limitations, and Future Directions

RUDDER has demonstrated empirical impact on complex, temporally extended RL tasks with sparse rewards, and in production-level LVLMs where efficient and effective hallucination mitigation is required. In both domains, RUDDER achieves strong performance with minimal overhead, by reframing signal assignment problems in terms of residual update analysis.

Key practical considerations include:

In RL, Align-RUDDER’s reliance on demonstration-derived profiles enables robust reward redistribution with very limited expert trajectories.
In LVLMs, effectiveness depends on tuning hyperparameters: injection layer, maximum steering strength, gate sensitivity, and clamping range; model-specific calibration is generally required (Zou et al., 13 Nov 2025).
RUDDER’s main limitation is its sensitivity to the quality of the underlying residual/return-decomposition model (either LSTM predictor or profile alignment), and the necessity of demonstration data or a high-precision prefill pass.

Future work may automate hyperparameter optimization for LVLM steering, extend profile-based reward redistribution to more complicated demonstration sets, and explore theoretical links between return decomposition and information-theoretic attributions in sequence models.

References:

(Patil et al., 2020): Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution
(Holzleitner et al., 2020): Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER
(Zou et al., 13 Nov 2025): Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision LLMs