FACTUM: Framework for Citation Trustworthiness
- FACTUM is a mechanistic framework that quantitatively assesses citation trustworthiness by dissecting internal Transformer components like attention and FFN pathways.
- It employs four interpretable scores—Parametric Force, Context Alignment, Beginning-of-Sentence Attention, and Pathway Alignment—to robustly identify and mitigate citation hallucinations.
- The framework integrates these signals into lightweight detectors, achieving significant improvements in token-level accuracy, precision, recall, and AUC over previous methods.
FACTUM (Framework for Attesting Citation Trustworthiness via Underlying Mechanisms) is a methodology for mechanistic detection of citation hallucination in long-form Retrieval-Augmented Generation (RAG) systems. It quantitatively characterizes the trustworthiness of generated citations by probing and aggregating internal model signals, specifically decomposing the contributions of attention and feed-forward network (FFN) pathways, as well as their geometric alignment. FACTUM directly addresses the challenge of “citation hallucinations,” where a LLM confidently cites sources that do not actually support the claims being made. By leveraging four interpretable mechanistic scores—Parametric Force, Context Alignment, Beginning-of-Sentence Attention, and Pathway Alignment—FACTUM offers scale-robust, quantifiable, and explainable guidance for both citation attribution and quality control in RAG pipelines (Dassen et al., 9 Jan 2026).
1. Formal Problem Definition: Citation Hallucination in RAG
FACTUM is designed for RAG systems in which a user query and a set of retrieved documents yield a generated long-form response containing inline citations of the form “[Source:i]”, where indexes into . Each citation-index token in the output is associated with a binary ground truth label :
- if document supports the directly-preceding factual claim.
- otherwise (i.e., hallucinated or mis-attributed citation).
The detection task is to estimate at each citation token, using the available inputs and internal model states. The goal is accurate, token-level identification of correct versus hallucinated citations, in contrast to black-box or post-hoc truthfulness scoring (Dassen et al., 9 Jan 2026).
2. Mechanistic Scores: Definitions and Intuitions
FACTUM decomposes the Transformer's residual updates at each generation step and each layer into multi-head attention (MHA) and FFN pathway components. Four mechanistic scores are defined:
a) Parametric Force Score (PFS):
Measures the magnitude of the FFN update vector, quantifying how strongly parametric (model-memory) information is introduced.
- Definition: Where .
High PFS suggests model output is being “pushed” toward stored (parametric) facts.
b) Context Alignment Score (CAS):
Assesses how well the generated citation token semantically aligns to retrieved evidence, via attention pathways.
- Definition: Where , and are context tokens.
High CAS indicates strong coupling between the citation and supporting document context.
c) Beginning-of-Sentence Attention Score (BAS):
Quantifies attention mass allocated to the sentence-initial token (information synthesis “sink”).
- Definition:
High BAS is associated with greater internal synthesis prior to citation generation.
d) Pathway Alignment Score (PAS):
Computes the cosine similarity between MHA and FFN update vectors.
- Definition:
PAS +1: cooperative pathway usage, PAS 0: orthogonal, PAS -1: antagonistic pathway updates.
3. Practical Pipeline for Score Measurement
Extracting mechanistic scores in FACTUM entails instrumenting the Transformer at inference time:
- Register hooks on all MHA and FFN sublayers; capture attention weights () and pre/post-FFN hidden states at the citation-index token position.
- For each response, identify citation tokens and compute PFS, CAS, BAS, and PAS for every layer and attention head as applicable.
- Aggregate scores through component-pruning (ranking heads/layers by correlation with ground-truth), head-aggregation (mean/std over heads), and layer-aggregation (summary statistics: mean, slope, frequency bandpower).
A reference pseudocode implementing these steps is provided in (Dassen et al., 9 Jan 2026) and is reproduced here for clarity:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
for each generated response R: identify all citation-index tokens t_i for each t_i: for each layer l: for each head h: A = attention_weights[l][h][i] BAS[l,h] = A[0] context_mask = is_context_token c = sum_j (A[j] * h_final[j]) over j ∈ T_C CAS[l,h] = cosine(c, h_final[i]) x_input = hidden_state_before_layer[l][i] v_attn = hidden_state_after_attn[l][i] - x_input x_pre_ffn = hidden_state_after_attn[l][i] x_post_ffn = hidden_state_after_ffn[l][i] v_ffn = x_post_ffn - x_pre_ffn PFS[l] = norm2(v_ffn) PAS[l] = cosine(v_attn, v_ffn) |
Feature vectors composed from these mechanistic readouts form the basis for downstream classification (e.g., via Logistic Regression, Explainable Boosting Machines, LightGBM).
4. Model Scale Dependence and Mechanistic Signature Evolution
Citation trustworthiness signals in FACTUM are not static; their structure changes with model scale:
- For Llama-3.2-3B, correct citations exhibit statistically significant increases in all four FACTUM scores (CAS, BAS, PFS, PAS), supporting a “full-agreement” signature.
- For Llama-3.1-8B, optimal performance is obtained by pruning to the top 25% most predictive heads/layers; in this regime, correct citations show strong PFS and BAS but either variable or decreased PAS, implying more specialized/orthogonal information transfer between MHA and FFN.
- CAS for larger models is also less reliable: signal directionality reverses or weakens for correct versus hallucinated citations.
These observations confirm that citation attribution is an emergent, scale-sensitive phenomenon. Regular re-calibration of the component selection and aggregation strategy is advised after model updates or fine-tuning (Dassen et al., 9 Jan 2026).
5. Key Experimental Findings and Quantitative Performance
Tasked with token-level classification of citation indices for TREC NeuCLIR 2024 (15-document long-context RAG), FACTUM demonstrates robust and statistically significant improvements relative to prior baselines (ReDeEP ECS+PKS, uncertainty-based scores):
| Model | Baseline | AUC | FACTUM (AUC) | Relative Improvement |
|---|---|---|---|---|
| Llama-3.2-3B | ECS+PKS (ReDeEP) | ∼ 0.61 | 0.715–0.737 | +37.5% |
| Llama-3.2-3B | CAS+PFS | ∼ 0.67 | 0.715–0.737 | +17–21% |
| Llama-3.1-8B | ECS+PKS | 0.54–0.61 | 0.69–0.74 | +13–28% |
| Llama-3.1-8B | CAS+PFS | ∼ 0.68 | 0.69–0.74 | +1.5–7% |
All improvements hold under FDR-adjusted significance tests (). Precision, Recall, and F1 metrics also increase, confirming balanced reduction in both hallucinated and missed citations (Dassen et al., 9 Jan 2026).
6. Integration and Best Practices for RAG Systems
FACTUM provides a unified workflow for scalable, model-internal citation auditing:
- Integrate via forward hooks in production RAG systems to compute PFS, CAS, BAS, PAS at generation.
- Use distilled features in a lightweight detector (Logistic Regression/EBM).
- Flag potentially hallucinated citations () for downstream “guardrail” interventions: warning users, triggering re-retrieval, or blocking low-trust attributions.
- For small models (3B), encourage holistic pathway engagement. For large models (8B), use component pruning and focus on orthogonality/force rather than just pathway agreement.
- Re-validate and re-select components post-fine-tuning, as attention/activation structure may shift.
These steps ensure robust, explainable detection of trust breakdowns in citation, addressing nuanced failure modes beyond mere “parametric over-reliance” and supporting systematic verifiability in production LLM-driven knowledge workflows.
7. Broader Implications for Trustworthy and Attested Citation
FACTUM fundamentally reframes citation hallucination as a complex, evolving dynamic across multiple neural mechanisms, rather than a single-pathway or scalar-confidence issue. By utilizing mechanistic signals—parametric force, context attention, attentional synthesis, and pathway alignment—Citation Trustworthiness becomes directly auditable and model-interpretable.
In the broader context of the FACTUM research program, the approach complements generative frameworks (such as RAEL and Intralign (Shen et al., 21 Apr 2025)) and fact-verification-centric pipelines (e.g., VeriFact-CoT (García et al., 6 Sep 2025)) by providing orthogonal, actionable evidence tied to the model's computational substrate. This mechanistic transparency is essential for deploying RAG systems in regulated/critical domains, enabling not just trust, but attestable provenance and error detection in AI-assisted citation generation.