Benchmarking and Improving Generator-Validator Consistency of Language Models (2310.01846v1)

Published 3 Oct 2023 in cs.CL and cs.LG

Abstract: As of September 2023, ChatGPT correctly answers "what is 7+8" with 15, but when asked "7+8=15, True or False" it responds with "False". This inconsistency between generating and validating an answer is prevalent in LLMs (LMs) and erodes trust. In this paper, we propose a framework for measuring the consistency between generation and validation (which we call generator-validator consistency, or GV-consistency), finding that even GPT-4, a state-of-the-art LM, is GV-consistent only 76% of the time. To improve the consistency of LMs, we propose to finetune on the filtered generator and validator responses that are GV-consistent, and call this approach consistency fine-tuning. We find that this approach improves GV-consistency of Alpaca-30B from 60% to 93%, and the improvement extrapolates to unseen tasks and domains (e.g., GV-consistency for positive style transfers extrapolates to unseen styles like humor). In addition to improving consistency, consistency fine-tuning improves both generator quality and validator accuracy without using any labeled data. Evaluated across 6 tasks, including math questions, knowledge-intensive QA, and instruction following, our method improves the generator quality by 16% and the validator accuracy by 6.3% across all tasks.

Citations (21)

View on Semantic Scholar

Summary

The paper introduces the GV-consistency metric to directly measure and improve the alignment between generation and validation outputs in language models.
The authors propose a consistency fine-tuning method that raised Alpaca-30B’s consistency from 60% to over 93%, enhancing overall model performance.
Evaluation across tasks such as arithmetic and style transfer highlights key challenges in reconciling generator and validator outputs for reliable AI applications.

Analyzing Generator-Validator Consistency in LLMs

In recent computational research, attention has been drawn to the phenomenon of generator and validator inconsistencies within LLMs (LMs). The paper "Benchmarking and Improving Generator-Validator Consistency of LMs" takes a comprehensive approach towards identifying and improving inconsistencies where LMs deliver contradictory outcomes between tasks of generation and validation. This inconsistency, if unaddressed, jeopardizes the trust in LMs, limiting their applicability in critical use cases.

Core Contributions and Methodology

The authors introduce a new metric, generator-validator consistency (GV-consistency), with an aim to directly measure and improve the reliability of LMs by harmonizing their generator and validator functions. This paper begins with establishing an evaluation framework that diagnoses GV-consistency across a selection of tasks, ranging from arithmetic to style transfer, bolstering the granularity of failure points in LMs. Notably, state-of-the-art models like GPT-4 were shown to maintain only 76% GV-consistency, which points towards a significant challenge in LM reliability that the research aims to mitigate.

To address these discrepancies, a methodology titled "consistency fine-tuning" was proposed. This technique involves finetuning the model on responses where generator and validator outputs were already aligned, leveraging this subset to nurture consistency. The authors achieved a substantial improvement in GV-consistency on the Alpaca-30B model, moving from 60% to upwards of 94%, while also yielding improvements in generator task completion and validator accuracy by 14% and 6.5% respectively across selected tasks.

Numerical Findings

The paper’s numerical results are noteworthy. Under the consistency fine-tuning process, a stark increase in Alpaca-30B consistency from 60% to 93% is observed. Additionally, the research highlights that GPT-3.5 scores the highest average consistency at 79.1%, outperforming text-davinci-003 (77%) and GPT-4 (75.8%). These results suggest that despite being robust, LMs continue to struggle with GV-consistency, especially in new and more complex tasks like plan arithmetic and prompt prioritization, where consistency hovered around 60%.

Implications and Future Directions

The implications for refining GV-consistency are substantial, particularly for applications demanding high reliability and trust, such as autonomous vehicles, critical information processing, and customer-facing AI interfaces. With enhanced consistency, LMs could provide more dependable outputs in diverse real-world applications, such as error-free mathematical reasoning and contextually accurate style modifications.

Theoretical implications underscore the necessity of designing LMs that not only optimize generation and validation tasks individually but do so in a manner that aligns their outputs with overlapping intent. This research serves as a foundation for future innovations aiming to integrate more complex validation protocols and fine-tuning approaches to further refine consistency.

Future research might focus on expanding the applicability of consistency fine-tuning across other LM tasks, incorporating different languages, and increasing the diversity of domain-specific contexts. Moreover, exploring finer-grained measures of consistency that tackle nuanced aspects of language interpretation could refine the alignment of consciously consistent LMs.

In conclusion, the paper presents a significant step forward in the ongoing development of reliable and trustworthy LLMs by demonstrating a targeted fine-tuning strategy that successfully addresses inherent model inconsistencies. This work invites continued exploration of novel approaches and techniques aiming to perfect the harmony between generating and validating processes in LLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jacobandreas/status/1748382490644070651

https://twitter.com/yanda_chen_/status/1750889715337671065

https://twitter.com/yanda_chen_/status/1760356173998375410