Papers
Topics
Authors
Recent
2000 character limit reached

Dissecting Atomic Facts: Visual Analytics for Improving Fact Annotations in Language Model Evaluation (2509.01460v1)

Published 1 Sep 2025 in cs.HC

Abstract: Factuality evaluation of LLM outputs requires decomposing text into discrete "atomic" facts. However, existing definitions of atomicity are underspecified, with empirical results showing high disagreement among annotators, both human and model-based, due to unresolved ambiguity in fact decomposition. We present a visual analytics concept to expose and analyze annotation inconsistencies in fact extraction. By visualizing semantic alignment, granularity and referential dependencies, our approach aims to enable systematic inspection of extracted facts and facilitate convergence through guided revision loops, establishing a more stable foundation for factuality evaluation benchmarks and improving LLM evaluation.

Summary

  • The paper introduces a visual analytics framework to systematically identify and correct fact annotation discrepancies between human and LLM outputs.
  • Methodology leverages SBERT embeddings, cosine similarity, and the Hungarian algorithm to address granularity and referential dependency challenges.
  • The approach aims to enhance LLM evaluation benchmarks by refining annotation guidelines and supporting iterative human-in-the-loop processes.

Dissecting Atomic Facts: Visual Analytics for Improving Fact Annotations in LLM Evaluation

Introduction

The paper "Dissecting Atomic Facts: Visual Analytics for Improving Fact Annotations in LLM Evaluation" presents an innovative approach to tackling inconsistencies in the evaluation of factuality in outputs generated by LLMs. The creation of atomic facts—defined as discrete, verifiable units of meaning—is central to assessing the factual accuracy of LLM outputs. However, the paper identifies a critical issue: the lack of a standardized definition for atomic facts, which leads to substantial disagreement among annotators. The authors propose a visual analytics framework to systematically identify and correct these inconsistencies, thereby improving the reliability of factuality benchmarks.

Methodological Insights

The authors describe a meticulous methodology to explore and rectify disagreements in fact annotation between human annotators and LLMs. They conducted manual annotations on a set of German administrative documents, comparing annotations made by humans using fine-tuned LLMs such as GPT-3.5, GPT-4, and Mistral-7B among others. They used SBERT embeddings to compute pairwise cosine similarities between fact lists, paired with the Hungarian algorithm for optimal fact assignment. The paper identifies two primary areas of disagreement: granularity and referential dependencies. Human and model annotators often diverged in their decomposition of conjunctive and conditional structures, and in their replication of contextual elements such as conditions and entities.

Visual Analytics Framework

To address annotation divergence, the authors propose a comprehensive visual analytics (VA) system. This system includes:

  • Text-anchored fact highlighting: Facilitates direct comparison of extracted facts by associating them with original text highlights.
  • Semantic similarity heatmap: Provides a color-coded matrix of fact alignments for comparative analysis and disagreement detection.
  • Fact count histogram: Displays the granularity levels across various annotators and models.
  • Knowledge graphs: Constructs entity-relation graphs for facts, aiding the identification of mismatches and uncertainties.
  • Branching logic visualization: Offers tree-structured depictions of conditionals and conjunctions for exploring semantic decomposition variants.

This framework supports systematic annotation guideline refinement and convergence measurement, aimed at achieving high inter-annotator agreement (IAA). The proposed approach involves multiple iterations of guideline refinement and reannotation, ultimately integrating human-in-the-loop LLM-guided processes for enhanced accuracy and efficiency.

Implications and Future Directions

The development of a VA system as proposed in the paper holds significant implications for improving LLM evaluation metrics. By attaining consistent fact annotations, the approach establishes a stable foundation for factuality evaluation benchmarks such as FActScore. The authors aim to further enhance this system with features like contradiction detection and semantic clustering. If successful, the methodology could pave the way for more robust LLM evaluation procedures, aiding the development of models capable of producing outputs with improved factual accuracy in critical domains like healthcare and law.

Conclusion

In conclusion, the paper highlights the inherent challenges in defining and annotating atomic facts for LLM evaluation and proposes a VA-based workflow to address these challenges. By surfacing annotation inconsistencies and supporting iterative refinement, the proposed system aims to solidify the foundation for factuality evaluation. This work underscores the importance of precise annotation standards and the potential of visual analytics in achieving annotation convergence, thereby enhancing the interpretability and reliability of LLM outputs.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.