Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries (2505.20451v1)

Published 26 May 2025 in cs.CL

Abstract: Today, LLMs are widely used as judges to evaluate responses from other LLMs. Hence, it is imperative to benchmark and improve these LLM-judges on real-world LLM usage: a typical human-assistant conversation is lengthy, and shows significant diversity in topics, intents, and requirements across turns, e.g. social interactions, task requests, feedback. We present Amulet, a framework that leverages pertinent linguistic concepts of dialog-acts and maxims to improve the accuracy of LLM-judges on preference data with complex, multi-turn conversational context. Amulet presents valuable insights about (a) the communicative structures and intents present in the conversation (dialog acts), and (b) the satisfaction of conversational principles (maxims) by the preference responses, and uses them to make judgments. On four challenging datasets, Amulet shows that (a) humans frequently (60 to 70 percent of the time) change their intents from one turn of the conversation to the next, and (b) in 75 percent of instances, the preference responses can be differentiated via dialog acts and/or maxims, reiterating the latter's significance in judging such data. Amulet can be used either as a judge by applying the framework to a single LLM, or integrated into a jury with different LLM judges; our judges and juries show strong improvements on relevant baselines for all four datasets.

Summary

Analysis of the Amulet Framework for LLM-Judging in Multi-Turn Conversations

The paper "Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries" introduces Amulet, an innovative framework aimed at enhancing the accuracy of LLM judges in evaluating preference data within complex, multi-turn conversational contexts. This contribution addresses a critical need in contemporary NLP research, as real-world dialogues between humans and machine assistants exhibit significant variability in topics, intents, and communicative structures over multiple conversational turns.

Key Contributions

The research identifies two primary linguistic concepts pivotal in improving LLM-based judgment systems: dialog acts (DA) and Grice's maxims. Dialog acts pertain to the communicative structures and intents inherent in dialogue, while Grice’s maxims encompass principles such as informativity, truth, relevance, and clarity, which determine conversational quality.

Amulet, therefore, refines LLM-judge accuracy by systematically annotating multi-turn conversations with these structures and principles. The observed frequency with which humans adapt their conversational intents—a change occurring in 60-70% of cases—demonstrates the necessity of such detailed annotation, while the applicability of dialog acts and/or maxim satisfaction in 75% of preference differentiation highlights the framework's utility.

Experimental Validation

The effectiveness of Amulet is demonstrated across four diverse datasets, showcasing the robustness of the framework. The datasets reveal a remarkable insight—humans often alter their dialog act intents approximately 70% of the time between conversational turns. Furthermore, in a significant majority (78%) of cases across these datasets, preference judgments were reliably improved by leveraging dialog acts or maxim considerations. This suggests that these linguistic concepts are crucial for LLM judges tasked with understanding nuanced conversational exchanges.

The paper further supplements model evaluation strategies by introducing amalgamated judge systems or “juries”. These integrate Amulet’s DA and Maxim annotations with traditional LLM judgments and state-of-the-art reward models, achieving more accurate judgment outcomes than existing baselines. The proposed Amulet-LM-Jury and Amulet-RM-Jury consistently outperform benchmark models, providing substantial improvements in judgment accuracy.

Implications and Future Directions

The practical implications of this research are vast, particularly for AI systems involved in dialogue generation and evaluation. By embedding nuanced linguistic structures into LLM evaluations, Amulet enhances our understanding of conversational AI and its alignment with human communication norms. This can significantly improve user experience in applications ranging from virtual assistants to automated customer service, where understanding user intent and context change is paramount.

Theoretically, this work opens avenues for further exploration into the depth of dialog acts and conversational maxims. Future investigations could incorporate additional linguistic features such as dependence relations and explore Amulet's application in broader AI training contexts. Moreover, considering the computational load of current LLM configurations, adapting Amulet's principles into smaller, more efficient models could democratize access and foster more widespread implementation.

Conclusion

The paper presents a robust case for the inclusion of dialog structure and communicative principles in evaluating multi-turn conversations, marking a notable advance in utilizing linguistic theories for practical LLM evaluation. Amulet, with its systematic approach to dialog complexity, emerges as a vital tool for the development and assessment of conversational AI systems. Through its applications, this framework holds potential to significantly enhance the alignment of AI models with human-like conversational dynamics, setting the course for future advancements in the responsible and effective deployment of large-scale LLMs.