Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rationale-Enhanced Decoding (RED) in LVLMs

Updated 3 July 2026
  • Rationale-Enhanced Decoding (RED) is an inference-time strategy that grounds predictions by conditioning on both visual inputs and model-generated rationales.
  • It employs a power-of-experts formulation that combines image-conditional and rationale-conditional probabilities via KL-constrained reward maximization.
  • Empirical results across multi-modal benchmarks show RED enhances reasoning accuracy and rationale faithfulness, despite doubling inference compute.

Rationale-Enhanced Decoding (RED) is an inference-time strategy for large vision-LLMs (LVLMs) in multi-modal chain-of-thought (CoT) reasoning. Its central objective is to enforce that reasoning and answer generation are grounded on both visual input and the intermediate, model-generated rationale. RED achieves this by combining image-conditional and rationale-conditional distributions at each token generation step using a power-of-experts formulation derived from a KL-constrained reward maximization framework. The method operates as a plug-and-play approach, requiring no architectural modifications or retraining, and is empirically shown to substantially improve both accuracy and rationale-faithfulness across a variety of multi-modal benchmarks and LVLM backbones (Yamaguchi et al., 10 Jul 2025).

1. Mathematical Framework

RED formulates multi-modal CoT decoding as a KL-constrained maximization of the rationale-conditional token log-likelihood, balanced by proximity to the image-conditional distribution. Consider an auto-regressive LVLM with:

  • Image-conditional next-token distribution: pθ(yiy<i,x,q)p_\theta(y_i | y_{<i}, x, q)
  • Rationale-conditional next-token distribution: pθ(yiy<i,r,q)p_\theta(y_i | y_{<i}, r, q)

where xx is the image, qq the question prompt, rr the generated rationale, and y<i=(y1,,yi1)y_{<i} = (y_1, \ldots, y_{i-1}) the prior output tokens.

RED seeks a decoding policy π(yiy<i,x,r,q)\pi(y_i | y_{<i}, x, r, q) that maximizes

Eyiπ(s)[logpθ(yiy<i,r,q)]βDKL[π(s)pθ(y<i,x,q)]\mathbb{E}_{y_i \sim \pi(\cdot|s)}\bigl[\log p_\theta(y_i|y_{<i}, r, q)\bigr] - \beta D_{KL}\big[\pi(\cdot|s) \,\|\, p_\theta(\cdot|y_{<i}, x, q)\big]

with β>0\beta > 0 as the tradeoff hyperparameter. The optimal solution (see [Rafailov et al., NeurIPS '23]) yields

p^θ(yi)=1Zpθ(yiy<i,x,q)[pθ(yiy<i,r,q)]λ\hat{p}_\theta(y_i) = \frac{1}{Z} p_\theta(y_i|y_{<i},x,q)\, [p_\theta(y_i|y_{<i}, r, q)]^\lambda

where pθ(yiy<i,r,q)p_\theta(y_i | y_{<i}, r, q)0 and pθ(yiy<i,r,q)p_\theta(y_i | y_{<i}, r, q)1 normalizes the distribution. Practically, with logits for each expert, RED combines the log-softmax scores:

pθ(yiy<i,r,q)p_\theta(y_i | y_{<i}, r, q)2

Token selection proceeds from the softmax of these combined logits. This policy is provably optimal for the KL-constrained objective and enforces consistent conditioning on both modalities (Yamaguchi et al., 10 Jul 2025).

2. Algorithmic Description

Algorithmically, RED augments the standard LVLM inference pipeline by introducing an extra forward pass per decoding step. The method proceeds as follows:

  1. Generate the intermediate rationale pθ(yiy<i,r,q)p_\theta(y_i | y_{<i}, r, q)3 conditioned on pθ(yiy<i,r,q)p_\theta(y_i | y_{<i}, r, q)4.
  2. For each output token pθ(yiy<i,r,q)p_\theta(y_i | y_{<i}, r, q)5:
    • Compute image-conditional logits pθ(yiy<i,r,q)p_\theta(y_i | y_{<i}, r, q)6
    • Compute rationale-conditional logits pθ(yiy<i,r,q)p_\theta(y_i | y_{<i}, r, q)7
    • Combine via pθ(yiy<i,r,q)p_\theta(y_i | y_{<i}, r, q)8
    • Sample or select pθ(yiy<i,r,q)p_\theta(y_i | y_{<i}, r, q)9 from xx0
  3. Repeat until sequence completion.

The hyperparameter xx1 tunes reliance on the rationale: higher xx2 prioritizes rationale-conditionality, but excessively large values diminish image dependence. Empirically, xx3 provides robust improvements without sacrificing visual grounding (Yamaguchi et al., 10 Jul 2025).

3. Theoretical Analysis and Comparison to CoT

Standard CoT in LVLMs feeds xx4 as a concatenated prompt and samples from xx5. Empirical analysis reveals that LVLMs often ignore the rationale, instead defaulting to image and prompt cues due to attention sinks and position biases. RED addresses this by explicitly factoring the conditional via two separate experts—one per modality/context—and combining their token distributions multiplicatively.

This "AND-like" power-of-experts operation produces next-token probabilities that are simultaneously high under both image-conditional and rationale-conditional experts, thus enforcing genuine rationale grounding. The KL-constraint derivation confers optimality: sampling from the RED distribution is equivalent to maximizing rationale log-likelihood penalized by KL to the image-conditional policy (Yamaguchi et al., 10 Jul 2025).

Alternative approaches, such as rationale-only decoding, mixture-of-experts, or reversed power, were found to underperform the RED formulation, signifying the importance of the explicit power-of-experts structure.

4. Empirical Findings

RED was evaluated on six prominent multi-modal reasoning benchmarks:

Benchmark Task Domain Notable Effect of RED
GQA Visual question answering Across LVLMs, RED + CoT/CCoT yields the best accuracy
TextVQA Text-based VQA Significant gains over baseline CoT
MME Perception & Cognition Consistent improvements in reasoning
SEED-I Diverse fine-grained Enhances text-understanding, spatial-relations
LLaVA-Bench Multi-modal benchmarks Superior to naive CoT and CCoT
MM-Vet VQA and capabilities Robust gains with model scaling

On GQA with Gemma-3-12B, for instance, baseline accuracy was 45.34, with RED-augmented CoT reaching 46.07 and CCoT+RED achieving 47.50. RED's benefits persisted across model backbones (Gemma-3, Qwen-2.5-VL, Llama3-LLaVA-Next) and were robust as model size increased up to 72B parameters—standard CoT/CCoT tended to plateau or degrade at scale, whereas RED maintained upward trends (see Figure 1 in (Yamaguchi et al., 10 Jul 2025)).

Further, intervention experiments showed RED is highly sensitive to rationale quality: supplying a high-quality GPT-4 rationale improved performance, whereas random rationales degraded it substantially—a property not observed in naive CoT/CCoT, which largely ignores rationale content.

Additionally, although not explicitly designed for hallucination reduction, RED matched or exceeded specialized baselines (VCD, ICD) on MMHal and POPE hallucination metrics while simultaneously boosting reasoning performance.

5. Practical Considerations and Limitations

RED can be integrated into any LVLM inference system capable of rationale generation and performing two distinct forward passes per decoding step. No retraining or architectural modification is required; all changes are at the inference-time logic. The only new hyperparameter is xx6, which directly regulates the rationale's influence.

Implementation requires approximately double the inference compute per generated token due to the extra forward pass. Potential avenues for future work include distillation or caching strategies to mitigate computational overhead.

The effectiveness of RED is contingent upon the quality and unbiasedness of the generated rationale xx7. If the rationale is incorrect or biased, RED will amplify those errors in final prediction. Approaches such as improved rationale-generation methods or combined human–machine rationale vetting could further enhance robustness and reliability.

6. Significance for Multi-modal Reasoning Systems

RED represents a rigorously justified, lightweight inference method for ensuring LVLMs perform reasoned, rationale-grounded multi-modal predictions. By provably enforcing simultaneous conditioning on visual input and chain-of-thought rationales, RED consistently yields gains in accuracy, interpretability, and faithfulness across challenging VQA and reasoning benchmarks. This suggests that RED can serve as a robust, general-purpose enhancement for any system employing chain-of-thought prompting in multi-modal LLMs, and may inform further research on integrating explicit intermediate reasoning in auto-regressive generation (Yamaguchi et al., 10 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rationale-Enhanced Decoding (RED).