Facilitating Human-LLM Collaboration through Factuality Scores and Source Attributions

Published 30 May 2024 in cs.HC and cs.AI | (2405.20434v1)

Abstract: While humans increasingly rely on LLMs, they are susceptible to generating inaccurate or false information, also known as "hallucinations". Technical advancements have been made in algorithms that detect hallucinated content by assessing the factuality of the model's responses and attributing sections of those responses to specific source documents. However, there is limited research on how to effectively communicate this information to users in ways that will help them appropriately calibrate their trust toward LLMs. To address this issue, we conducted a scenario-based study (N=104) to systematically compare the impact of various design strategies for communicating factuality and source attribution on participants' ratings of trust, preferences, and ease in validating response accuracy. Our findings reveal that participants preferred a design in which phrases within a response were color-coded based on the computed factuality scores. Additionally, participants increased their trust ratings when relevant sections of the source material were highlighted or responses were annotated with reference numbers corresponding to those sources, compared to when they received no annotation in the source material. Our study offers practical design guidelines to facilitate human-LLM collaboration and it promotes a new human role to carefully evaluate and take responsibility for their use of LLM outputs.

Abstract PDF HTML Upgrade to Chat

Authors (8)

Citations (1)

View on Semantic Scholar

Summary

The paper's main contribution is the use of factuality scores to dynamically calibrate trust in LLM-generated content.
It employs a scenario-based study testing multiple design styles and granularities to enhance error validation and user experience.
The study underscores the importance of interface flexibility and user training for effective human-LLM collaboration in real-world applications.

Facilitating Human-LLM Collaboration through Factuality Scores and Source Attributions

Introduction

The paper "Facilitating Human-LLM Collaboration through Factuality Scores and Source Attributions" addresses the challenges faced by users interacting with LLMs which are prone to generating hallucinations—factually incorrect information presented as truth. This research explores technical strategies, such as using factuality scores and source attributions, to improve user trust and collaboration with LLMs by effectively communicating the accuracy and origins of the model's responses.

Background and Motivation

LLMs, such as those in natural language processing tasks, often generate outputs not entirely faithful to the source data, leading to mistrust from users. Despite advancements in algorithmic techniques to identify and mitigate false content, effectively conveying this information to users remains a critical unsolved problem. The study by Do et al. investigates design strategies for representing factuality scores and source attributions with an end-user focus to enhance trust calibration and facilitate proper usage of AI-generated content.

Methodology

The researchers conducted a scenario-based study with 104 participants evaluating different design strategies for presenting LLM responses. The design strategies tested included:

Factuality Score Styles:

Highlight-all: Full color coding according to factuality.
Highlight-threshold: Highlighting only below-threshold factuality scores.
Score: Numerical and color-coded factuality scores with underlines.

These styles were examined at two linguistic granularities (word and phrase level).

Source Attribution Styles:

Reference numbers: Inline citations with the corresponding source.
Highlight gradients: Source content importance marking.

The designs were tested for user trust, preference, and accuracy validation ease compared to a baseline with no markup.

Results and Analysis

Trust Ratings

All factuality design styles improved trust ratings over the baseline, with word-level granularities slightly outperforming phrase-level in fostering trust. The study noted significant trust calibration influences—the initial perceived accuracy significantly impacted subsequent trust adjustments when factuality scores were visible.

Participants who initially underestimated accuracy increased trust, while those who overestimated decreased trust upon exposure to factuality information.

Ease of Validation

Ease of validating the LLM response accuracy showed variations across designs. Highlight-threshold and highlight-all at word-level performed favorably compared to the baseline, facilitating better error detection and validation.

Preference

Participants generally preferred the highlight-all design at phrase-level granularity, viewed as a comprehensive overview of potential inaccuracies. Among source attribution methods, the reference number strategy was marginally preferred, attributed to clarity and structured referencing.

Figure 1: Reference numbers

Implications and Recommendations

The study presents several implications for designing LLM interfaces:

Highlighting Factuality: The highlight-all style is recommended for addressing user trust and presenting factuality. For real-world applications, balancing comprehensive information with cognitive load is crucial.
User Training: Trust calibration remains essential. Users must understand the potential inaccuracies in LLM outputs and rely on supplementary source attribution to verify content.
Tools Flexibility: Interfaces should allow users to toggle highlighting options, adapting to user preferences and minimizing distraction during in-depth analyses.

Conclusion

This paper contributes significant insights into improving human-LLM interaction through strategic UI designs that communicate factuality and source information. Future research could explore these designs in different cultural and contextual settings or develop automated systems for dynamic adjustment based on user behavior and feedback. Through these strategies, the study empowers users to confidently, accurately, and responsibly interact with LLMs, enhancing AI's applicability in society.

Markdown Report Issue