How Much Do LLMs Hallucinate in Document Q&A? A 172-Billion-Token Study

This presentation examines a large-scale study of hallucination in language models performing document question-answering tasks. Using the RIKER framework—a ground-truth-first evaluation methodology—researchers tested 35 open-weight models across 172 billion tokens, varying context lengths up to 200K tokens, temperatures, and hardware platforms. The study reveals that no current model is free from hallucination, with fabrication rates climbing as context grows and surprising disconnects between a model's ability to retrieve correct facts versus its tendency to invent false ones.
Script
Even the best language models fabricate facts they've never seen. At 32 thousand tokens, top models hallucinate over 1 percent of the time. Stretch that context to 200 thousand tokens, and not a single model stays below 10 percent fabrication.
Traditional benchmarks suffer from contamination and annotation errors. The researchers inverted the process: first specify the ground truth, then generate documents designed to test it. This allowed them to probe hallucination at unprecedented scale with deterministic, reproducible answers.
The results reveal patterns that challenge common assumptions about model behavior.
Here's the most counterintuitive finding: models that excel at extracting real facts simultaneously invent false ones at alarming rates. A model might score 90 percent on grounding but fabricate answers to 40 percent of hallucination probes. These are independent failure modes.
Three forces shape hallucination rates. Longer contexts universally degrade performance, with some models collapsing from 7 percent fabrication at 32 thousand tokens to over 70 percent at 200 thousand. Model family trumps size—a well-trained smaller model can outperform a poorly aligned larger one. Surprisingly, hardware choice has almost no impact.
The conventional wisdom to use temperature zero for factual tasks is wrong. Higher temperatures sometimes lower fabrication rates and dramatically reduce coherence failures, where models spiral into infinite repetition. One model showed 48 times more output loops at temperature zero than at higher settings.
No open-weight model is hallucination-free in document question answering, and longer contexts make the problem worse. Fabrication resistance isn't a side effect of better retrieval—it's an independent challenge requiring deliberate training strategies. Visit EmergentMind.com to explore this research further and create your own video presentations.