Fusion-in-Decoder (FiD) for Open-Domain QA

Updated 28 October 2025

Fusion-in-Decoder (FiD) is a retrieval-augmented generation architecture that encodes multiple passages independently and fuses them in the decoder.
It enhances open-domain question answering by enabling multi-passage reasoning and mitigating issues like context fragmentation.
FiD adapts its decoder cross-attention via temperature rescaling to handle variations in context quality, ensuring robust performance.

Fusion-in-Decoder (FiD) is a retrieval-augmented generation (RAG) architecture introduced for knowledge-intensive NLP tasks, particularly open-domain question answering (ODQA). It is characterized by independently encoding multiple retrieved passages and fusing their representations only at the decoder stage, enabling highly effective multi-passage reasoning and evidence integration. FiD is the backbone of numerous state-of-the-art systems and has inspired a wide array of extensions focusing on accuracy, efficiency, provenance, and evidence selection.

1. Architectural Principles of Fusion-in-Decoder

FiD operates on retrieved passages $c = \{p_i\}_{i=1}^N$ and a question $q$ . Each passage is prepended with $q$ and encoded independently by a shared encoder (e.g., T5 encoder). The representations from all passages are concatenated and provided as cross-attention keys/values to the decoder, which generates the answer auto-regressively. There is no positional encoding that distinguishes passage order; thus, the decoder's cross-attention must reason jointly over all passages to extract and compose relevant evidence.

$a = D([E(\tilde{p}_1); \ldots; E(\tilde{p}_n)])$

where $D$ is the decoder and $E$ is the shared encoder. This fusion-in-decoder design avoids the scaling and context fragmentation problems of early retrieval-augmented models and enables straightforward processing of hundreds of passages at test time.

Key characteristics:

Each passage is encoded in isolation, so encoder FLOPs scale linearly in the number of passages.
The decoding stage is responsible for fusing all evidence and generating the final output.
There is no architectural enforcement of passage relationships or relevance; the decoder aggregates and selects evidence via attention.

2. Impact of Context Quantity and Quality

Experimental work demonstrates that context quality—the proportion of relevant passages among the input set—fundamentally determines FiD performance during training and evaluation. Formally,

$\text{quality} = \frac{|R(q)|}{N}$

where $|R(q)|$ is the count of relevant passages and $N$ is total input passages. Results show that FiD models overfit to the context quality profile seen in training: evaluation performance peaks only when context quality at test time matches that of training. Significant divergence between training and evaluation context quality causes steep monotonic declines in exact match (EM) performance. In contrast, the raw number of passages ( $N$ ) is significantly less important; models are robust to quantity per se, but highly sensitive to quality.

When context quality varies (mixtures of relevant and spurious passages), the learned decoder attention can either become uniform (when trained on high-quality contexts) or selective (when trained on low-quality contexts). Overfitting manifests as a model that fails to adapt attention behavior when the support distribution at test time mismatches that of training.

3. Cross-Attention Distribution and Overfitting Dynamics

Cross-attention in FiD's decoder displays characteristic patterns that reflect training context quality. Aggregate attention scores for each passage $i$ in decoder layer $k$ are denoted: $\tilde{c}_i^{(k)} = \sum_j c_{ij1}^{(k)}$ where $c_{ijl}^{(k)}$ is the attention probability from the $l$ -th decoder token to the $j$ -th token in passage $i$ .

Low context quality training yields sharply selective cross-attention: The decoder prioritizes very few tokens/passages, often to the exclusion of others, resulting in brittle failures when more evidence is available at test time.
High context quality training induces more uniform attention: The decoder distributes attention widely, sometimes failing to focus sharply when confronted with many spurious, low-quality passages.

These differences are magnified in higher decoder layers, which have been found to encode more task- and evidence-specific behaviors.

4. Mitigation: Cross-Attention Temperature Rescaling

To address the overfitting of attention to context quality, FiD can be adapted post hoc by introducing a rescaling parameter $T$ into the cross-attention softmax: $w_{il}^{(k)} = \left( \text{softmax}\left[ \frac{\log (\sum_j c_{1jl})}{T}, \ldots, \frac{\log (\sum_j c_{Njl})}{T} \right] \right)_i$ Increasing $T$ produces a more uniform attention distribution, while lower $T$ sharpens selectivity. This intervention allows adjustment of decoder attention sharpness at inference without retraining, mitigating the degradation associated with context quality shift. Empirical results confirm that with cross-validated $T$ , FiD models maintain much stronger EM scores under varying context regimes, substantially narrowing the quality-transfer gap.

5. Implications for Training and Real-World Deployment

Training on a fixed-quality context leads to overspecialized models and poses risks when retrieval system or document database characteristics shift post-deployment.
Direct adaptation via cross-attention temperature scaling is a practical tool for hardening FiD against environmental drift, avoiding costly retraining.
Broad relevance: The context quality overfitting phenomenon likely holds for other retrieval-augmented generators and knowledge-intensive tasks (dialogue, summarization, code, etc.), suggesting that attention rescaling and similar post-hoc adaptation mechanisms are of general utility.

6. Summary Table: Core Metrics and Adaptation Mechanisms

Aspect	Formula / Method	Effect
Context Quality	$\|R(q)\|/N$	Drives overfitting/problem
EM Score	1 if predicted answer $\in$ gold answers, else 0	Task evaluation
Cross-attention	$c_{ijl}^{(k)}$ , $\tilde{c}_i^{(k)} = \sum_j c_{ij1}^{(k)}$	Reveals attention selectivity
Attention bias	Softmax temperature $T$ (see equation above)	Mitigates overfitting

FiD's architectural paradigm has been widely extended—e.g., by introducing rationale classifiers and decoder guidance to combat reliance on spurious features (Wang et al., 2023), multi-granularity evidence discernment and dynamic pruning (Choi et al., 3 Apr 2024), or hardware/memory optimizations to enhance deployment efficiency (Jong et al., 2022, Hofstätter et al., 2022, Berchansky et al., 2023).
The sensitivity to context quality highlights a broader challenge in RAG systems: robust aggregation and reasoning in the presence of variable evidence composition and reliability, and the need for flexible, schema-agnostic adaptation mechanisms.

FiD thus exemplifies both the potential and pitfalls of fusion-based retrieval-augmented generation and underscores the necessity of understanding and modulating model attention behavior as retrieval quality fluctuates throughout the model lifecycle.