Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 164 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 40 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 102 tok/s Pro
Kimi K2 216 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Beyond the Chat: Executable and Verifiable Text-Editing with LLMs (2309.15337v1)

Published 27 Sep 2023 in cs.CL and cs.HC

Abstract: Conversational interfaces powered by LLMs have recently become a popular way to obtain feedback during document editing. However, standard chat-based conversational interfaces do not support transparency and verifiability of the editing changes that they suggest. To give the author more agency when editing with an LLM, we present InkSync, an editing interface that suggests executable edits directly within the document being edited. Because LLMs are known to introduce factual errors, Inksync also supports a 3-stage approach to mitigate this risk: Warn authors when a suggested edit introduces new information, help authors Verify the new information's accuracy through external search, and allow an auditor to perform an a-posteriori verification by Auditing the document via a trace of all auto-generated content. Two usability studies confirm the effectiveness of InkSync's components when compared to standard LLM-based chat interfaces, leading to more accurate, more efficient editing, and improved user experience.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. The (un) suitability of automatic evaluation metrics for text simplification. Computational Linguistics 47, 4 (2021), 861–889.
  2. On suggesting phrases vs. predicting words for mobile text composition. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology. 603–608.
  3. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023 (2023).
  4. Soylent: a word processor with a crowd inside. In Proceedings of the 23nd annual ACM symposium on User interface software and technology. 313–322.
  5. From tool to companion: Storywriters want AI writers to respect their personal values and writing strategies. In Designing Interactive Systems Conference. 1209–1227.
  6. Carlo E Bonferroni. 1935. Il calcolo delle assicurazioni su gruppi di teste. Studi in onore del professore salvatore ortu carboni (1935), 13–60.
  7. Collaborative writing with Web 2.0 technologies: education students’ perceptions. (2011).
  8. The impact of multiple parallel phrase suggestions on email input and composition behaviour of native and non-native english writers. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–13.
  9. How Novelists Use Generative Language Models: An Exploratory User Study.. In HAI-GEN+ user2agent@ IUI.
  10. Help me write a Poem-Instruction Tuning as a Vehicle for Collaborative Poetry Writing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 6848–6863.
  11. Next Steps for Human-Centered Generative AI: A Technical Perspective. arXiv preprint arXiv:2306.15774 (2023).
  12. Decontextualization: Making sentences stand-alone. Transactions of the Association for Computational Linguistics 9 (2021), 447–461.
  13. TaleBrush: Sketching stories with generative pretrained language models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–19.
  14. Creative writing with a machine in the loop: Case studies on slogans and stories. In 23rd International Conference on Intelligent User Interfaces. 329–340.
  15. Wordcraft: a human-ai collaborative editor for story writing. arXiv preprint arXiv:2107.07430 (2021).
  16. The AI Ghostwriter Effect: Users Do Not Perceive Ownership of AI-Generated Text But Self-Declare as Authors. arXiv preprint arXiv:2303.03283 (2023).
  17. Read, Revise, Repeat: A System Demonstration for Human-in-the-loop Iterative Text Revision. In2Writing 2022 (2022), 96.
  18. Understanding Iterative Revision from Human-Written Text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3573–3590.
  19. Bradley Efron. 1982. The Jackknife, the Bootstrap and other resampling plans. In CBMS-NSF Regional Conference Series in Applied Mathematics.
  20. QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2587–2601.
  21. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics 9 (2021), 1460–1474.
  22. Enabling Large Language Models to Generate Text with Citations. arXiv preprint arXiv:2305.14627 (2023).
  23. A design space for writing support tools using a cognitive process model of writing. In Proceedings of the First Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2022). 11–24.
  24. Sparks: Inspiration for science writing using language models. In Designing interactive systems conference. 1002–1019.
  25. Social dynamics of AI support in creative writing. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–15.
  26. Tanya Goyal and Greg Durrett. 2021. Annotating and Modeling Fine-grained Factuality in Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1449–1462.
  27. Joy Paul Guilford. 1967. The nature of human intelligence. (1967).
  28. A survey on automated fact-checking. Transactions of the Association for Computational Linguistics 10 (2022), 178–206.
  29. Passages: interacting with text across documents. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–17.
  30. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020).
  31. TRUE: Re-evaluating Factual Consistency Evaluation. In Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering. 161–175.
  32. The Case for a Single Model that can Both Generate Continuations and Fill-in-the-Blank. In Findings of the Association for Computational Linguistics: NAACL 2022. 2421–2432.
  33. Threddy: An Interactive System for Personalized Thread-based Exploration and Organization of Scientific Literature. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–15.
  34. Tae Wook Kim and Quan Tan. 2023. Repurposing Text-Generating AI into a Thought-Provoking Writing Tutor. arXiv preprint arXiv:2304.10543 (2023).
  35. Neural Text Summarization: A Critical Evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 540–551.
  36. LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond. arXiv preprint arXiv:2305.14540 (2023).
  37. Keep It Simple: Unsupervised Simplification of Multi-Paragraph Text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 6365–6378.
  38. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics 10 (2022), 163–177.
  39. SWiPE: A Dataset for Document-Level Simplification of Wikipedia Pages. Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (2023).
  40. Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities. In Proceedings of the 2022 CHI conference on human factors in computing systems. 1–19.
  41. Evaluating human-language model interaction. arXiv preprint arXiv:2212.09746 (2022).
  42. Suggestion lists vs. continuous generation: Interaction design for writing with generative models on mobile devices affect text length, wording and perceived authorship. In Proceedings of Mensch und Computer 2022. 192–208.
  43. Vladimir I. Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet physics. Doklady 10 (1966), 707–710.
  44. Evaluating verifiability in generative search engines. arXiv preprint arXiv:2304.09848 (2023).
  45. LENS: A Learnable Evaluation Metric for Text Simplification. arXiv preprint arXiv:2212.09739 (2022).
  46. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1906–1919.
  47. Co-Writing Screenplays and Theatre Scripts with Language Models: Evaluation by Industry Professionals. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–34.
  48. Fake news detection: A hybrid CNN-RNN based deep learning approach. International Journal of Information Management Data Insights 1, 1 (2021), 100007.
  49. WearWrite: Crowd-assisted writing from smartwatches. In Proceedings of the 2016 CHI conference on human factors in computing systems. 3834–3846.
  50. OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023).
  51. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155 (2022).
  52. Vishakh Padmakumar and He He. 2023. Does Writing with Language Models Reduce Content Diversity? arXiv preprint arXiv:2309.05196 (2023).
  53. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
  54. Automatically neutralizing subjective bias in text. In Proceedings of the aaai conference on artificial intelligence, Vol. 34. 480–489.
  55. CoEdIT: Text Editing by Task-Specific Instruction Tuning. arXiv preprint arXiv:2305.09857 (2023).
  56. Joseph M Reagle Jr. 2010. “Be Nice”: Wikipedia norms for supportive communication. New Review of Hypermedia and Multimedia 16, 1-2 (2010), 161–180.
  57. A Recipe for Arbitrary Text Style Transfer with Large Language Models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 837–848.
  58. Jeba Rezwana and Mary Lou Maher. 2022. Identifying ethical issues in ai partners in human-ai co-creation. arXiv preprint arXiv:2204.07644 (2022).
  59. Melissa Roemmele and Andrew S Gordon. 2018. Automated assistance for creative writing with an rnn language model. In Proceedings of the 23rd international conference on intelligent user interfaces companion. 1–2.
  60. Revisiting non-English Text Simplification: A Unified Multilingual Benchmark. arXiv preprint arXiv:2305.15678 (2023).
  61. Characterizing stage-aware writing assistance for collaborative document authoring. Proceedings of the ACM on Human-Computer Interaction 4, CSCW3 (2021), 1–29.
  62. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761 (2023).
  63. Peer: A collaborative language model. arXiv preprint arXiv:2208.11663 (2022).
  64. Combating fake news: A survey on identification and mitigation techniques. ACM Transactions on Intelligent Systems and Technology (TIST) 10, 3 (2019), 1–42.
  65. Beyond Summarization: Designing AI Support for Real-World Expository Writing Tasks. arXiv preprint arXiv:2304.02623 (2023).
  66. Where to hide a stolen elephant: Leaps in creative writing with multimodal machine intelligence. ACM Transactions on Computer-Human Interaction (2022).
  67. Supporting collaborative writing with microtasks. In Proceedings of the 2016 CHI conference on human factors in computing systems. 2657–2668.
  68. FEVER: a Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 809–819.
  69. Ellis Paul Torrance. 1966. Torrance tests of creative thinking: Norms-technical manual: Verbal tests, forms a and b: Figural tests, forms a and b. Personal Press, Incorporated.
  70. LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs. arXiv preprint arXiv:2307.10168 (2023).
  71. Yonghui Wu. 2018. Smart compose: Using neural networks to help write emails. Google AI Blog (2018).
  72. Problems in current text simplification research: New data can help. Transactions of the Association for Computational Linguistics 3 (2015), 283–297.
  73. Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics 4 (2016), 401–415.
Citations (12)

Summary

  • The paper introduces the InkSync interface, enabling executable text edits via LLMs for enhanced interactivity and verifiability.
  • Methodology leverages Chat, Comment, Markers, and Brainstorm components to improve language quality and user control.
  • Usability studies reveal that the Warn, Verify, and Audit framework nearly doubles accuracy by effectively mitigating factual errors.

Executable and Verifiable Text Editing with LLMs

The development of LLMs has considerably enhanced the capabilities of automated text editing processes. The paper entitled "Beyond the Chat: Executable and Verifiable Text-Editing with LLMs" introduces a novel interface called InkSync which leverages the power of LLMs to make text editing more interactive, transparent, and verifiable. This essay provides a detailed overview of the key innovations, findings, and potential implications highlighted in the paper.

Innovative System Components

InkSync introduces several components designed to enhance user control and transparency in text editing tasks. The system enables users to interact with multiple features such as Chat, Comment, Markers, and Brainstorm to suggest executable edits within a document. These components aim to mitigate challenges associated with traditional conversational LLM interfaces, such as a lack of agency and difficulty in managing factual accuracy. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: Responses by 64 surveyed participants on document editing habits and LLM usage in text editing tasks.

InkSync Interface Overview

InkSync facilitates real-time document editing by integrating suggestions directly into text, differentiated visually by underlines and colors indicating the source component. Users can view, accept, or dismiss these suggestions, enabling a high degree of interaction and customization. The (Figure 2) illustrates the interface layout, highlighting the component panels and the editing workspace. Figure 2

Figure 2: The InkSync text editing interface layout.

Warn, Verify, and Audit System

Given LLMs' tendency to hallucinate or introduce inaccuracies, InkSync's Warn, Verify, and Audit framework allows users to identify, verify, and trace factual content in their documents. This approach involves a three-stage process:

  1. Warn: Alerts users to edits introducing new information via visual cues.
  2. Verify: Provides search queries to facilitate fact-checking.
  3. Audit: Allows for the retrospective tracing of system-generated content. Figure 3

    Figure 3: Overview of the Warn, and Verify components and Audit interface in the InkSync system.

Usability Studies and Results

The paper describes two usability studies aimed at evaluating InkSync's effectiveness compared to traditional LLM interfaces.

Study 1: Interaction Style Evaluation

This paper assessed users' ability to achieve editing goals using various InkSync components. Results demonstrated that the inclusion of Markers led to significant improvements in language quality, while Chat and Comment components enhanced document customization. However, creative response diversity was higher in manual editing conditions, indicating that while LLMs improve efficiency, they might also reduce creative diversity.

Study 2: Verification Framework Evaluation

The second paper evaluated the Warn, Verify, and Audit components in preventing and detecting inaccuracies. Results showed that enabling these features almost doubled the rate of avoided inaccuracies compared to interfaces without such support. Additionally, the audit process, which occurs after editing, contributed to further improvement in catching factual errors, demonstrating the framework's efficacy.

Implementation and Future Work

InkSync’s open-source nature allows it to be adapted to different LLMs, although the paper primarily utilized GPT-4 due to its advanced capabilities. Future research directions could explore optimizing LLM parameters for increased content diversity and integrating InkSync into collaborative writing platforms to accommodate multi-author dynamics.

Conclusion

The InkSync interface exemplifies a significant step forward in human-AI text editing interactions. By facilitating executable and verifiable edits, InkSync empowers users with better control and accuracy in document editing, fulfilling critical needs in professional writing environments. As LLM applications continue to expand, the principles and findings presented in this paper will likely influence future developments in AI-driven editing tools.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com