Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits (2409.14509v3)

Published 22 Sep 2024 in cs.CL, cs.CY, and cs.HC

Abstract: LLM-based applications are helping people write, and LLM-generated text is making its way into social media, journalism, and our classrooms. However, the differences between LLM-generated and human-written text remain unclear. To explore this, we hired professional writers to edit paragraphs in several creative domains. We first found these writers agree on undesirable idiosyncrasies in LLM-generated text, formalizing it into a seven-category taxonomy (e.g. cliches, unnecessary exposition). Second, we curated the LAMP corpus: 1,057 LLM-generated paragraphs edited by professional writers according to our taxonomy. Analysis of LAMP reveals that none of the LLMs used in our study (GPT4o, Claude-3.5-Sonnet, Llama-3.1-70b) outperform each other in terms of writing quality, revealing common limitations across model families. Third, we explored automatic editing methods to improve LLM-generated text. A large-scale preference annotation confirms that although experts largely prefer text edited by other experts, automatic editing methods show promise in improving alignment between LLM-generated and human-written text.

Authors (3)

Tuhin Chakrabarty (33 papers)
Philippe Laban (40 papers)
Chien-Sheng Wu (77 papers)

Citations (2)

View on Semantic Scholar

Summary

Mitigating Idiosyncrasies in LLM-Generated Text through Expert Edits: The LAMP Corpus

The paper presented in "Can AI Writing be Salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits," provides a robust framework for understanding and improving the quality of text generated by LLMs. By deploying a systematic, expert-driven approach, the authors aim to bridge the gap between AI-generated text and human writing standards.

Overview of the Study

The paper explores several key dimensions:

Identifying Idiosyncrasies in LLM-Generated Text: Establishes a seven-category taxonomy of writing issues based on consensus from professional writers.
Creating the LAMP Corpus: Compiles 1,057 LLM-generated paragraphs edited by professional writers, annotated with 8,035 fine-grained edits.
Comparative Analysis of LLMs: Evaluates writing quality across GPT-4o, Claude-3.5 Sonnet, and Llama-3.1-70B, finding no significant performance differences.
Developing and Testing Automated Methods: Proposes and evaluates automated detection and rewriting techniques to improve LLM-generated text.

Key Findings

Expert Consensus on Writing Issues

In the formative paper, the consensus pointed to seven core categories of issues frequently observed in LLM-generated text:

Clichés
Unnecessary/Redundant Exposition
Purple Prose
Poor Sentence Structure
Lack of Specificity and Detail
Awkward Word Choice and Phrasing
Tense Inconsistency

These categories were derived from a comprehensive analysis of 50 initial free-form categories, refined through iterative discussions and ultimately consolidated. This taxonomy provides a structured approach, helping to identify and rectify common pitfalls in AI writing.

The LAMP Corpus

The LAMP (LLM Authored, Manually Polished) Corpus, a critical resource released by the authors, consists of 1,057 LLM-generated paragraphs edited by 18 professional writers. The analysis of these edits reveals several essential insights:

Edit Operations: Primarily replacements (74%), followed by deletions (18%) and insertions (8%).
Editing Variability: Significant variation exists across different writers in terms of the type and quantity of edits.
Model Comparisons: Despite expectations, no significant quality differences were found among the analyzed LLMs regarding the generated text.

Automated Methods for Detection and Rewriting

To tackle the scalability issue of expert edits, the paper proposes automated methods for detecting and rewriting problematic spans in LLM-generated text:

Detection: Characterized as a multi-span categorical extraction problem, evaluated using precision metrics. A few-shot learning approach yielded the highest general precision of 0.46, which is below human agreement levels but shows promise.
Rewriting: Few-shot prompting techniques tailored for each edit category were used to rewrite problematic spans. The integration of these methods into an automated editing pipeline showed encouraging results in manual evaluations.

Experimental Insights

Preference Evaluation

Through a preference annotation paper involving 12 expert writers, the edited text was consistently ranked:

Writer-edited
LLM-edited (both oracle and fully automated detection)
Original LLM-generated

These results validate the notion that while automated edits may not surpass the quality of expert edits, they significantly improve upon the original LLM-generated text.

Implications and Future Directions

Practical Implications

The findings have several vital implications for the deployment of AI-driven writing tools:

Enhanced Productivity: Automated editing can significantly improve the initial drafts generated by LLMs, making the subsequent human edits more efficient.
Educational Applications: LLM-based tools can be developed to assist students with writing, providing instant feedback and improvement suggestions.

Theoretical Implications

From a computational linguistics perspective, this research advances our understanding of:

Human-AI Alignment: Highlighting the necessity of aligning AI-generated content with human standards and preferences.
Scalability of Expert Knowledge: Demonstrating the feasibility of encoding expert knowledge into automated systems.

Speculations on Future Developments

Future research could aim to:

Expand Corpus Domains: Including a broader range of genres beyond creative writing.
Improve Automated Methods: Incorporating more advanced models and training on larger datasets could yield better performance.
Longitudinal Studies: Investigate the long-term effects of LLM-generated and edited text on readers and writers.

Conclusion

The paper provides a detailed, empirical approach to understanding and mitigating the limitations of LLM-generated text. By combining expert knowledge with automated techniques, the paper not only enhances the quality of AI writing but also advances the field of human-AI interaction in creative and technical writing domains. The release of the LAMP Corpus sets a new standard for future research, offering a robust resource for further innovations in AI-assisted writing tools.

PDF Markdown

Related Papers

Tweets

https://twitter.com/TuhinChakr/status/1841511480220360900

https://twitter.com/SFResearch/status/1905365925286871047

https://twitter.com/TuhinChakr/status/1895544371476644208

https://twitter.com/TuhinChakr/status/1899610788995969111

https://twitter.com/PhilippeLaban/status/1881745173044232391