Determine LLM Summarization Accuracy for Multi-Document Sensemaking

Determine how accurately large language models can summarize when analyzing multiple given documents in sensemaking tasks, ideally by evaluating their outputs against established ground-truth summaries to quantify performance.

Background

The paper investigates LLM-supported summarization within complex sensemaking scenarios, where analysts must synthesize connections across multiple documents. Prior work on LLM summarization has emphasized hallucination, similarity, and quality in open-ended tasks, but less emphasis has been placed on accuracy when models must integrate information across multiple given documents with a definitive ground truth.

To address this gap, the authors propose using an intermediate visual workspace to steer LLM summarization and employ a sensemaking dataset with ground-truth summaries to evaluate accuracy. The explicitly stated knowledge gap concerns the level of accuracy achievable by LLMs in multi-document sensemaking contexts, motivating systematic evaluation with appropriate benchmarks.

References

We lack an understanding of how accurately LLMs can summarize when analyzing multiple given documents in sensemaking tasks.

Steering LLM Summarization with Visual Workspaces for Sensemaking  (2409.17289 - Tang et al., 2024) in Introduction