Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 87 tok/s

Gemini 2.5 Pro 56 tok/s Pro

GPT-5 Medium 16 tok/s Pro

GPT-5 High 18 tok/s Pro

GPT-4o 98 tok/s Pro

Kimi K2 210 tok/s Pro

GPT OSS 120B 451 tok/s Pro

Claude Sonnet 4 39 tok/s Pro

2000 character limit reached

Navigating Rifts in Human-LLM Grounding: Study and Benchmark (2503.13975v2)

Published 18 Mar 2025 in cs.CL and cs.HC

Abstract: LLMs excel at following instructions but often struggle with the collaborative aspects of conversation that humans naturally employ. This limitation in grounding -- the process by which conversation participants establish mutual understanding -- can lead to outcomes ranging from frustrated users to serious consequences in high-stakes scenarios. To systematically study grounding challenges in human-LLM interactions, we analyze logs from three human-assistant datasets: WildChat, MultiWOZ, and Bing Chat. We develop a taxonomy of grounding acts and build models to annotate and forecast grounding behavior. Our findings reveal significant differences in human-human and human-LLM grounding: LLMs were three times less likely to initiate clarification and sixteen times less likely to provide follow-up requests than humans. Additionally, we find that early grounding failures predict later interaction breakdowns. Building on these insights, we introduce Rifts, a benchmark derived from publicly available LLM interaction data containing situations where LLMs fail to initiate grounding. We note that current frontier models perform poorly on Rifts, highlighting the need to reconsider how we train and prompt LLMs for human interaction. To this end, we develop a preliminary intervention aimed at mitigating grounding failures.

Summary

Navigating Rifts in Human-LLM Grounding

The paper "Navigating Rifts in Human-LLM Grounding: Study and Benchmark" addresses the limitations of LLMs in achieving effective conversational grounding with humans. Grounding, as defined in this context, refers to the communicative acts that establish mutual understanding between conversation participants. The researchers systematically analyze the grounding challenges in human-LLM interactions by examining datasets including WildChat, MultiWOZ, and Bing Chat, culminating in the development of a taxonomy of grounding acts and the Rifts benchmark, which assesses the performance of LLMs in scenarios necessitating grounding.

Grounding Challenges in LLMs

LLMs, while proficient at following explicit instructions, often lack the ability to engage collaboratively in dynamic dialogues, a critical skill necessary for grounding. The challenge lies in their scarcity to perform clarificatory and follow-up acts, which humans commonly use to resolve ambiguities and achieve shared understanding. The paper found that LLMs are significantly less likely than humans to initiate clarification (three times less) and follow-up requests (sixteen times less). This deficit in initiating grounding can lead to interaction breakdowns, which can range from frustrated users in everyday scenarios to more dire consequences in high-stakes situations.

Analysis and Insights

The authors developed a set of dialogue acts to evaluate grounding in human-LLM interactions, leading to discovery of notable asymmetries. Through annotated interaction logs, they highlighted that humans are more often required to address grounding failures than LLMs, which rarely preemptively attempt clarification. LLMs instead generate verbose responses, often containing irrelevant information, instead of actively seeking or confirming information necessary for grounding.

The Rifts Benchmark

In response to these findings, the Rifts benchmark was introduced to test LLMs in situations where grounding actions are needed. It consists of approximately 1.8K tasks from public interaction logs, designed to see if LLMs can generate clarification and follow-up requests effectively. Most existing models performed poorly on these tasks, indicating the need to reevaluate training approaches for LLMs to handle human interactions better.

Intervention Strategies

The researchers propose a preliminary intervention strategy using their grounding forecaster, which marginally improves LLM performance by predicting when grounding acts are needed and prompting corresponding clarificatory actions. However, the room for improvement signifies that both foundational training adjustments and enhanced dialogue management techniques are required.

Implications and Future Directions

The paper's implications extend into both practical and theoretical domains of AI research. On a practical level, improving grounding capabilities in LLMs could significantly enhance user experience and trust in conversational agents, especially in tasks requiring nuanced understanding and collaboration. Theoretically, it underscores the importance of integrating decision-theoretic approaches into LLM dialogue policies to manage uncertainty about user goals and objectives.

Future developments could focus on incorporating more dynamic, human-like grounding behaviors into LLM architectures. This could involve refining instruction-based training models or exploring hybrid approaches that blend rule-based systems with machine learning to achieve a more balanced initiatory behavior in dialogues.

In conclusion, while the paper illustrates critical challenges in LLM grounding, it also opens up avenues for significant advancements in AI capabilities by fostering interactions that are not only responsive but also proactively cooperative, aligning more closely with genuine human conversational practices.