ConFiQA Benchmark

Updated 8 January 2026

ConFiQA Benchmark is a rigorous evaluation protocol that measures LLMs' context-faithfulness by challenging them with counterfactual facts versus memorized knowledge.
It employs a structured dataset of 18,000 examples, including single-hop, multi-hop, and multi-conflict questions derived from Wikidata and Wikipedia statistics.
Alignment techniques such as Context-DPO and ContextFocus significantly improve fidelity by shifting the model’s focus from internal memory to retrieved context.

The ConFiQA (Context-Faithfulness Question Answering) benchmark is a rigorous evaluation protocol designed to quantify the ability of LLMs to faithfully follow retrieved context, especially when this context is engineered to conflict with the model’s own internal (parametric) knowledge. ConFiQA directly targets the “knowledge-conflict” regime in Retrieval-Augmented Generation (RAG): models are challenged with contexts containing counterfactual facts and assessed on whether they generate answers faithful to these contexts rather than relying on their pre-trained factual memory. The benchmark serves both as a quantitative yardstick and as a diagnostic tool for recent context-alignment methods, such as Context-DPO and ContextFocus, which seek to mitigate the stubborn reliance of LLMs on memorized facts.

1. Dataset Construction and Structure

ConFiQA derives its factual seed triples from Wikidata, targeting prominent entities with high public visibility. Specifically, the corpus includes 5,042 head entities—sourced from the top 1,000 most-viewed Wikipedia pages—and 30,295 one-to-one triples spanning 41 relation types (e.g., P17 “country,” P36 “capital,” P35 “head of state”). For each triple $(s, r, t)$ , ChatGPT-4 is used to generate approximately 100 words of factual context that highlights $t$ in natural language.

Questions in ConFiQA fall into three categories:

QA (“single-hop”): Direct queries (e.g., “What is the capital of the United States?”) derived from individual triples.
MR (“multi-hop reasoning”): Questions constructed across chains of two, three, or four triples, requiring compositional reasoning. Intermediate (bridge) entities are omitted from the question.
MC (“multi-conflict”): Multi-hop chains wherein every hop is replaced with a counterfactual, forcing reasoning over fully confounded contexts.

Knowledge conflicts are systematically introduced:

For QA: the tail $t$ is replaced with a counterfactual $t'$ (same type), and all appearances of $t$ —including aliases and variations—are swapped in context.
For MR: only one hop in the factual chain is corrupted.
For MC: every hop is confounded with counterfactual edits.

ConFiQA comprises exactly 6,000 examples per split (18,000 total), evenly partitioned across hop lengths (2K/3K/4K for MR and MC).

2. Annotation Protocol and Labeling

Manual annotation is unnecessary for ConFiQA. Each example is paired automatically with two chains:

Faithful (context-derived) answer $y_n$ : Follows the counterfactual path(s).
Stubborn (model-knowledge) answer $y_o$ : Follows the factual path(s).

The tuple $\langle x, y_n, y_o\rangle$ (with $x=$ [context || question]) defines a “preference pair,” distinguishing strictly between answers generated by contextual reasoning and those arising from the model’s pretrained memory. Negations or hedged corrections are excluded from faithful matches to avoid accidental alignment.

3. Task Definition

ConFiQA’s fundamental task is retrieval-following open-domain QA under explicit knowledge conflict. For each input (context, question), the model must generate a free-form answer. The outcome is classified based on whether the model outputs:

The counterfactual (context-derived) answer $A^c$ ,
The original (memorized) answer $A^o$ ,
Or any other residual answer.

This formulation supports both binary faithfulness classification— $\hat y=A^c$ (faithful) vs. $\hat y=A^o$ (stubborn)—as well as exact-match assessments.

4. Evaluation Metrics

ConFiQA employs four strict metrics to measure context-faithfulness:

Context-faithful precision $P_{c}$ : Proportion of outputs matching $A^c$ or its allowed aliases.

$P_{c} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\left[\hat{y}_i \in \mathrm{alias}(A^c_i)\right]$

Original-answer precision $P_{o}$ : Proportion matching the model’s memorized answer or aliases.

$P_{o} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\left[\hat{y}_i \in \mathrm{alias}(A^o_i)\right]$

Reluctance measure $M_R$ : Fraction of definitive answers where the model refuses the context,

$M_R = \frac{P_o}{P_c + P_o}$

Exact match (EM): Strict rate at which output exactly equals $A^c$ ,

$EM = \frac{1}{N} \sum_{i=1}^N \mathbf{1}[\hat{y}_i = A^c_i]$

Standard metrics (accuracy, span-F1) are not used; these custom measures focus solely on answer source fidelity.

An alternative notation, adopted in “ContextFocus” (Anand et al., 7 Jan 2026), uses $p_o$ (original-answer rate), $p_s$ (substituted/contextual-answer rate), and $M_R$ (reluctance ratio), with identical mathematical definitions.

5. Baseline Performance and Contextual Alignment Techniques

Initial evaluation of open-source LLMs (Llama2-7B-chat, Llama3-8B-instruct, Mistral-7B-instruct, Qwen2-7B-instruct) on ConFiQA revealed a marked tendency to revert to internal knowledge:

Base model $P_c$ : Typically in the 20–60% range across splits.
Original-answer precision $P_o$ : Frequently 30–40% or higher, with $M_R$ ~50% or above.

For example, Mistral-7B-instruct (Base) scored $P_c$ =39.3%, $P_o$ =40.5%, $M_R$ =50.8%, EM=0.3% (QA split).

Application of Context-DPO (Bi et al., 2024)—a direct preference optimization aligning models to prefer context-faithful outputs—produced substantial gains:

Llama2-7B-chat rose from $P_c$ =61.5% to 92.3% on QA.
Qwen2-7B-instruct increased $P_c$ from 24.0% to 74.3% (QA).
Other models demonstrated relative improvements from 35% up to 280%, with $P_o$ and $M_R$ dropping commensurately.

ContextFocus (Anand et al., 7 Jan 2026), which achieves faithfulness via activation-steering rather than finetuning, also delivers best-in-class results (e.g., $p_s$ =70.9%, $M_R$ =11.6% on QA, surpassing ContextDPO and COIECD baselines).

Method	Subset	$p_s$ (%)	$p_o$ (%)	$M_R$ (%)
Base	QA	35.3	32.3	47.8
ContextDPO	QA	58.4	18.6	24.2
ContextFocus	QA	70.9	9.3	11.6
ContextFocus	MC	53.3	10.2	16.1

6. Interpretability, Efficiency, and Broader Analyses

Deeper analyses of model behavior post-alignment provide insights into the mechanism of context-faithfulness improvement:

Knowledge token capturing: Algorithmic identification of tokens where the output diverges between context and parametric knowledge shows a shift in model logits by 16.8–21.0 points favoring context-faithful tokens after alignment, promoting these into top softmax ranks.
Activation-steering observations (“ContextFocus”): Layer selection is critical; maximal faithfulness improvements occur in intermediate layers (Llama-8B: layer 13, Mistral-7B: layer 11). Cosine similarity between steering vectors is $>0.9999$ with only 1,500 examples, evidencing rapid convergence and data efficiency.
Fluency and generative capability: Both Context-DPO and ContextFocus maintain baseline model fluency and factual ability, with metrics such as Natural Questions accuracy ( $>$ 93%) and TruthfulQA scores affected by less than 1%.

Context ablation demonstrates superior results when both context and system instruction signals are contrasted during activation-steering, compared to isolating either alone.

Efficiency studies confirm that activation steering in ContextFocus incurs negligible inference latency—matching standard forward-pass speed and outperforming contrastive decoding baselines by $3\text{--}4\times$ .

7. Significance and Applications

ConFiQA provides a controlled, large-scale environment to systematically interrogate LLMs’ behavior under retrieval-induced knowledge conflict. It enables granular comparison of context-following mechanisms, such as prompt engineering, supervised fine-tuning, direct preference optimization (Context-DPO), and activation steering (ContextFocus). A plausible implication is the benchmark’s utility for real-world RAG systems where external evidence may supersede pre-trained knowledge—for instance, in time-sensitive open-domain QA or domain-adaptive dialogue.

By rigorously quantifying reluctance to context and improvements in faithfulness, ConFiQA enforces stricter standards for LLM deployment in retrieval-augmented and update-demanding settings, driving advances in model alignment techniques capable of bridging the gap between static parametric memory and dynamic external information (Bi et al., 2024, Anand et al., 7 Jan 2026).