Rationale-Enhanced Decoding (RED) in LVLMs

Updated 3 July 2026

Rationale-Enhanced Decoding (RED) is an inference-time strategy that grounds predictions by conditioning on both visual inputs and model-generated rationales.
It employs a power-of-experts formulation that combines image-conditional and rationale-conditional probabilities via KL-constrained reward maximization.
Empirical results across multi-modal benchmarks show RED enhances reasoning accuracy and rationale faithfulness, despite doubling inference compute.

Rationale-Enhanced Decoding (RED) is an inference-time strategy for large vision-LLMs (LVLMs) in multi-modal chain-of-thought (CoT) reasoning. Its central objective is to enforce that reasoning and answer generation are grounded on both visual input and the intermediate, model-generated rationale. RED achieves this by combining image-conditional and rationale-conditional distributions at each token generation step using a power-of-experts formulation derived from a KL-constrained reward maximization framework. The method operates as a plug-and-play approach, requiring no architectural modifications or retraining, and is empirically shown to substantially improve both accuracy and rationale-faithfulness across a variety of multi-modal benchmarks and LVLM backbones (Yamaguchi et al., 10 Jul 2025).

1. Mathematical Framework

RED formulates multi-modal CoT decoding as a KL-constrained maximization of the rationale-conditional token log-likelihood, balanced by proximity to the image-conditional distribution. Consider an auto-regressive LVLM with:

Image-conditional next-token distribution: $p_\theta(y_i | y_{<i}, x, q)$
Rationale-conditional next-token distribution: $p_\theta(y_i | y_{<i}, r, q)$

where $x$ is the image, $q$ the question prompt, $r$ the generated rationale, and $y_{<i} = (y_1, \ldots, y_{i-1})$ the prior output tokens.

RED seeks a decoding policy $\pi(y_i | y_{<i}, x, r, q)$ that maximizes

$\mathbb{E}_{y_i \sim \pi(\cdot|s)}\bigl[\log p_\theta(y_i|y_{<i}, r, q)\bigr] - \beta D_{KL}\big[\pi(\cdot|s) \,\|\, p_\theta(\cdot|y_{<i}, x, q)\big]$

with $\beta > 0$ as the tradeoff hyperparameter. The optimal solution (see [Rafailov et al., NeurIPS '23]) yields

$\hat{p}_\theta(y_i) = \frac{1}{Z} p_\theta(y_i|y_{<i},x,q)\, [p_\theta(y_i|y_{<i}, r, q)]^\lambda$

where $p_\theta(y_i | y_{<i}, r, q)$ 0 and $p_\theta(y_i | y_{<i}, r, q)$ 1 normalizes the distribution. Practically, with logits for each expert, RED combines the log-softmax scores:

$p_\theta(y_i | y_{<i}, r, q)$ 2

Token selection proceeds from the softmax of these combined logits. This policy is provably optimal for the KL-constrained objective and enforces consistent conditioning on both modalities (Yamaguchi et al., 10 Jul 2025).

2. Algorithmic Description

Algorithmically, RED augments the standard LVLM inference pipeline by introducing an extra forward pass per decoding step. The method proceeds as follows:

Generate the intermediate rationale $p_\theta(y_i | y_{<i}, r, q)$ 3 conditioned on $p_\theta(y_i | y_{<i}, r, q)$ 4.
For each output token $p_\theta(y_i | y_{<i}, r, q)$ $p_{θ} (y_{i} ∣ y_{< i}, r, q)$ 5:
- Compute image-conditional logits $p_\theta(y_i | y_{<i}, r, q)$ 6
- Compute rationale-conditional logits $p_\theta(y_i | y_{<i}, r, q)$ 7
- Combine via $p_\theta(y_i | y_{<i}, r, q)$ 8
- Sample or select $p_\theta(y_i | y_{<i}, r, q)$ 9 from $x$ 0
Repeat until sequence completion.

The hyperparameter $x$ 1 tunes reliance on the rationale: higher $x$ 2 prioritizes rationale-conditionality, but excessively large values diminish image dependence. Empirically, $x$ 3 provides robust improvements without sacrificing visual grounding (Yamaguchi et al., 10 Jul 2025).

3. Theoretical Analysis and Comparison to CoT

Standard CoT in LVLMs feeds $x$ 4 as a concatenated prompt and samples from $x$ 5. Empirical analysis reveals that LVLMs often ignore the rationale, instead defaulting to image and prompt cues due to attention sinks and position biases. RED addresses this by explicitly factoring the conditional via two separate experts—one per modality/context—and combining their token distributions multiplicatively.

This "AND-like" power-of-experts operation produces next-token probabilities that are simultaneously high under both image-conditional and rationale-conditional experts, thus enforcing genuine rationale grounding. The KL-constraint derivation confers optimality: sampling from the RED distribution is equivalent to maximizing rationale log-likelihood penalized by KL to the image-conditional policy (Yamaguchi et al., 10 Jul 2025).

Alternative approaches, such as rationale-only decoding, mixture-of-experts, or reversed power, were found to underperform the RED formulation, signifying the importance of the explicit power-of-experts structure.

4. Empirical Findings

RED was evaluated on six prominent multi-modal reasoning benchmarks:

Benchmark	Task Domain	Notable Effect of RED
GQA	Visual question answering	Across LVLMs, RED + CoT/CCoT yields the best accuracy
TextVQA	Text-based VQA	Significant gains over baseline CoT
MME	Perception & Cognition	Consistent improvements in reasoning
SEED-I	Diverse fine-grained	Enhances text-understanding, spatial-relations
LLaVA-Bench	Multi-modal benchmarks	Superior to naive CoT and CCoT
MM-Vet	VQA and capabilities	Robust gains with model scaling

On GQA with Gemma-3-12B, for instance, baseline accuracy was 45.34, with RED-augmented CoT reaching 46.07 and CCoT+RED achieving 47.50. RED's benefits persisted across model backbones (Gemma-3, Qwen-2.5-VL, Llama3-LLaVA-Next) and were robust as model size increased up to 72B parameters—standard CoT/CCoT tended to plateau or degrade at scale, whereas RED maintained upward trends (see Figure 1 in (Yamaguchi et al., 10 Jul 2025)).

Further, intervention experiments showed RED is highly sensitive to rationale quality: supplying a high-quality GPT-4 rationale improved performance, whereas random rationales degraded it substantially—a property not observed in naive CoT/CCoT, which largely ignores rationale content.

Additionally, although not explicitly designed for hallucination reduction, RED matched or exceeded specialized baselines (VCD, ICD) on MMHal and POPE hallucination metrics while simultaneously boosting reasoning performance.

5. Practical Considerations and Limitations

RED can be integrated into any LVLM inference system capable of rationale generation and performing two distinct forward passes per decoding step. No retraining or architectural modification is required; all changes are at the inference-time logic. The only new hyperparameter is $x$ 6, which directly regulates the rationale's influence.

Implementation requires approximately double the inference compute per generated token due to the extra forward pass. Potential avenues for future work include distillation or caching strategies to mitigate computational overhead.

The effectiveness of RED is contingent upon the quality and unbiasedness of the generated rationale $x$ 7. If the rationale is incorrect or biased, RED will amplify those errors in final prediction. Approaches such as improved rationale-generation methods or combined human–machine rationale vetting could further enhance robustness and reliability.

RED represents a rigorously justified, lightweight inference method for ensuring LVLMs perform reasoned, rationale-grounded multi-modal predictions. By provably enforcing simultaneous conditioning on visual input and chain-of-thought rationales, RED consistently yields gains in accuracy, interpretability, and faithfulness across challenging VQA and reasoning benchmarks. This suggests that RED can serve as a robust, general-purpose enhancement for any system employing chain-of-thought prompting in multi-modal LLMs, and may inform further research on integrating explicit intermediate reasoning in auto-regressive generation (Yamaguchi et al., 10 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rationale-Enhanced Decoding (RED).