Fusion-in-Decoder (FiD) Architecture
- Fusion-in-Decoder (FiD) is a retrieval-augmented sequence-to-sequence architecture that processes multiple passages by fusing independently encoded representations in the decoder’s cross-attention layer.
- It enables scalable joint reasoning over long or multi-document contexts, supporting applications like open-domain question answering and fact verification with dense retrieval.
- Optimized variants such as FiD-Light and FiDO employ compression, layer-sparse cross-attention, and token pruning to significantly reduce latency while maintaining high accuracy.
Fusion-in-Decoder (FiD) is a retrieval-augmented sequence-to-sequence architecture that enables LLMs to process and generate responses conditioned on multiple retrieved passages. The core innovation is in fusing the encoder outputs for each passage not by concatenation at the encoder input, but as a single attention pool for the decoder. This allows scalable joint reasoning over long or multi-document contexts, directly leveraging dense retrieval. FiD and its variants are widely used in open-domain question answering, knowledge-intensive text generation, rationale extraction, and fact verification.
1. Core Architecture and Mechanisms
The FiD architecture relies on parallel, independent encoding of each retrieved text passage or chunk, followed by fusion exclusively within the decoder’s cross-attention layers. For an input question and retrieved passages , the forward pipeline is as follows:
- Encoding: Each “[query; context]” pair is independently processed by a shared Transformer encoder to produce a matrix of hidden vectors .
- Fusion: All encoder outputs are concatenated (not summed or pooled), yielding a single matrix .
- Decoding with Fusion: At each auto-regressive decoding timestep , the Transformer decoder performs cross-attention over the entire , integrating information from all passages. The attention mechanism operates as:
where is the decoder state at the previous time step.
This structure is strictly more scalable than concatenating all content into one encoder context, as it avoids length bottlenecks in the encoder and achieves finer-grained memory organization (Jong et al., 2022, Lakhotia et al., 2020, Wang et al., 2023). The fusion mechanism can also be implemented as “attend-each-then-sum” or “concatenate-then-attend”, both being functionally equivalent up to ordering and cost.
2. Computational Complexity and Bottlenecks
FiD’s main computational expense arises in the decoding phase, especially when handling many retrieved passages:
- Encoder cost: , with parallel encoding.
- Decoder cost: Each output token requires a cross-attention of length , leading to a cost of for cell-wise cross-attention.
Empirical profiling for T5-Base with passages (∼10,000 encoder tokens) yields an encoder latency of 50 ms and decoder cost of 600 ms, with decoding responsible for over 90% of inference time (Hofstätter et al., 2022, Jong et al., 2022). This imbalance results largely from memory bandwidth constraints, as cross-attention must load all encoder tokens per decoder layer, and the problem worsens as passage count increases (Jong et al., 2022, Berchansky et al., 2023).
3. Major Variants and Efficiency Solutions
Several optimizations and variants have been introduced to address the fundamental efficiency bottlenecks in FiD:
FiD-Light
- Compression: Compress each encoded passage to vectors before concatenation, e.g., by truncating to the first vectors: .
- Impact: The cross-attention length is reduced from to , dramatically lowering compute and bandwidth demands.
- Empirical Results: With (T5-Base, 40 passages), total inference time drops from 650 ms to 100–130 ms while retaining baseline accuracy. Larger backbones allow FiD-Light to exceed baseline FiD accuracy at speed (Hofstätter et al., 2022).
FiDO: Fusion-in-Decoder Optimized
- Layer-Sparse Cross-Attention (LSA): Only a fraction ($1/K$) of decoder layers perform cross-attention; others skip cross-attention altogether.
- Multi-Query Attention (MQA): Share key/value projections across decoder heads, reducing data movement by a factor of (number of heads).
- Asymmetric Architectures: Distribute fewer FLOPs to the encoder and increase the decoder’s size, as decoder is the new bottleneck.
- Speedup: Jointly, LSA and MQA yield a measured – speedup; combined with scale-up, FiDO-Base/XL achieves the accuracy of a much larger FiD model with significantly reduced inference latency (Jong et al., 2022).
Token Elimination and Pruning
- Token Elimination: Dynamically drops encoder tokens with low cross-attention score during decoding, further shortening the attention pool by up to with minimal accuracy loss.
- Combined with dynamic early exiting (e.g., CALM), this can reduce decoding latency by with drop in ROUGE-L while sometimes even improving performance (Berchansky et al., 2023).
Multi-Granularity Evidence Guidance (MGFiD, RFiD, FiD-Ex)
- MGFiD: Applies passage-level re-ranking and sentence-level classification. Results include an anchor vector from key evidence sentences, injected at decoding, and passage pruning for efficiency. Passage pruning reduces decoder cost by 70\% with EM loss; total improvement over FiD-KD is +3.5 EM (NQ) and +1.0 EM (TQA) (Choi et al., 2024).
- RFiD: Adds causal/spurious labels to each passage, appends rationale-guided embeddings, and jointly trains evidence prediction. Outperforms FiD by +1.5 EM (NQ) and +0.7 EM (TQA), increasing decoder focus on causal passages (Wang et al., 2023).
- FiD-Ex: Injects sentence markers and restricts decoder generation to extractive rationales, substantially improving factuality and rationale faithfulness (Lakhotia et al., 2020).
4. Practical Applications and Evidence-Focused Extensions
FiD and its descendants are a foundation for state-of-the-art retrieval-augmented generation (RAG) across tasks:
- Open-Domain Question Answering (ODQA): Multi-document fusion enables high-fidelity answer generation with explicit provenance tracking across retrieved evidence.
- Fact Verification: Passage-level and sentence-level evidence annotation support granular attribution and increased robustness to spurious context (Choi et al., 2024, Wang et al., 2023).
- Explainable NLP (FiD-Ex): Enforces extractive rationales, preventing fabricated explanations by restricting outputs to source sentences marked with unique identifiers (Lakhotia et al., 2020).
- Evaluation Benchmarks: FiD-Light and FiDO report new state-of-the-art scores on KILT tasks—such as HotpotQA, FEVER, T-REx, TriviaQA, zsRE, Wizard of Wikipedia—with SOTA metrics for both answer quality and provenance (Hofstätter et al., 2022, Jong et al., 2022).
5. Performance Trade-Offs and Benchmarking
The following summarizes latency–accuracy trade-offs and memory constraints:
| Model | Inference Time (ms/sample) | NQ EM | Speedup vs. FiD-Base |
|---|---|---|---|
| FiD-Base | 102 | 46.5 | 1.0× |
| FiD-Base+LSA | 29 | 45.8 | 3.5× |
| FiD-Base+LSA+MQA | 7 | 48.2 | 14.6× |
| FiDO-Base/XL | 15 | 48.2 | 6.8× |
| FiD-Light (k=8) | 100–130 | 90% of FiD |
Increasing model scale in FiD-Light and FiDO variants not only compensates the minor accuracy drop due to compression or pruning but yields improved results at a fraction of the original runtime or resource cost (Jong et al., 2022, Hofstätter et al., 2022).
6. Robustness, Provenance, and Reranking
- Source-Pointer Re-Ranking (FiD-Light): Generates passage indices as part of the textual answer and reorders candidate passages based on those indices, improving passage-level R-Precision by 2–6 absolute points (e.g., TriviaQA: 34.1%→37.6%) without additional parameters or retraining (Hofstätter et al., 2022).
- Granular Supervision (MGFiD, RFiD): By training auxiliary heads for passage and sentence-level evidence, models become less prone to spurious context and more robust to variations in retrieval, distribution shifts, and noisy negatives. The anchor vector and rationale-guided embeddings yield additional marginal gains (Choi et al., 2024, Wang et al., 2023).
7. Limitations, Extensions, and Future Research
- Pretraining Dependency: FiDO requires training from scratch, prohibiting simple conversion from existing T5 checkpoints (Jong et al., 2022).
- Context Length Scaling: Despite architectural optimizations, very large retrieval batches or extreme output lengths may require dynamic adaptation of decoder cross-attention sparsity or pruning schedule.
- Batch Size Sensitivity: At very low batch sizes, memory bandwidth remains a constraint even with the most efficient architectures.
- Extensibility: Research directions include integration with improved retrievers (DPR, reranking), memory augmentation, knowledge distillation, and dynamic attention sparsification (Jong et al., 2022, Berchansky et al., 2023, Hofstätter et al., 2022).
FiD and its optimized variants define the state-of-the-art paradigm for high-throughput, evidence-grounded text generation over retrieved document sets, with a well-quantified efficiency-effectiveness Pareto frontier and robust mechanisms for evidence control and explainability (Jong et al., 2022, Hofstätter et al., 2022, Choi et al., 2024).