Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Uniform Complexity for Text Generation (2204.05185v3)

Published 11 Apr 2022 in cs.CL and cs.LG

Abstract: LLMs have shown promising results in a wide array of generative NLP tasks, such as summarization and machine translation. In the context of narrative generation, however, existing models still do not capture factors that contribute to producing consistent text. For instance, it is logical that a piece of text or a story should be uniformly readable throughout and that this form of complexity should be controllable. As such, if the complexity of an input text prompt is rated first-grade reading level in the Flesch Reading Ease test, then the generated text continuing the plot should also be within this range of complexity. With this in mind, we introduce Uniform Complexity for Text Generation (UCTG), a new benchmark test which raises the challenge of making generative models observe uniform linguistic properties with respect to prompts. We experiment with over 150+ linguistically and cognitively motivated features for evaluating text complexity in humans and generative models. From our results, we find that models such as GPT-2 struggle to preserve the complexity of input prompts used in its generations, even if finetuned with professionally written texts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Sweta Agrawal and Marine Carpuat. 2019. Controlling text complexity in neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1549–1564, Hong Kong, China. Association for Computational Linguistics.
  2. STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6470–6484, Online. Association for Computational Linguistics.
  3. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  4. Regina Barzilay and Mirella Lapata. 2008. Modeling local coherence: An entity-based approach. Computational Linguistics, 34(1):1–34.
  5. Scott F Beers and William E Nagy. 2009. Syntactic complexity as a predictor of adolescent writing quality: Which measures? Which genre? Reading and Writing, 22(2):185–200.
  6. Mark Breuker. 2022. CEFR Labelling and Assessment Services. In European Language Grid: A Language Technology Platform for Multilingual Europe, pages 277–282. Springer International Publishing.
  7. Marc Brysbaert and Boris New. 2009. Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4):977–990.
  8. Curious case of language generation evaluation metrics: A cautionary tale. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2322–2328, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  9. John B Carroll. 1964. Language and Thought. Foundations of Modern Psychology Series.
  10. Changing the mind of transformers for topically-controllable language generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2601–2611, Online. Association for Computational Linguistics.
  11. Xiaobin Chen and Detmar Meurers. 2016. Characterizing text difficulty with word frequencies. In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pages 84–94, San Diego, CA. Association for Computational Linguistics.
  12. Meri Coleman and Ta Lin Liau. 1975. A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60(2):283.
  13. Decoding methods for neural narrative generation. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pages 166–185, Online. Association for Computational Linguistics.
  14. William H DuBay. 2004. The Principles of Readability. Online Submission.
  15. D Dugast. 1978. Sur quoi se fonde la notion d’étendue théoretique du vocabulaire? Le Francais Moderne, 46:25–32.
  16. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia. Association for Computational Linguistics.
  17. Cognitively motivated features for readability assessment. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 229–237, Athens, Greece. Association for Computational Linguistics.
  18. Rudolph Flesch. 1948. A new readability yardstick. Journal of Applied Psychology, 32(3):221.
  19. Irene C Fountas and Gay Su Pinnell. 1999. Matching Books to Readers: Using Leveled Books in Graded Reading, K-3. ERIC.
  20. Thomas François and Eleni Miltsakaki. 2012. Do NLP and machine learning improve traditional readability formulas? In Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations, pages 49–57, Montréal, Canada. Association for Computational Linguistics.
  21. The GEM benchmark: Natural language generation, its evaluation and metrics. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pages 96–120, Online. Association for Computational Linguistics.
  22. Dmitriy Genzel and Eugene Charniak. 2002. Entropy rate constancy in text. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 199–206, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  23. Edward E Gickling and David L Armstrong. 1978. Levels of Instructional Difficulty as Related to On-task Behavior, Task Completion, and Comprehension. Journal of Learning Disabilities, 11(9):559–566.
  24. Mario Giulianelli and Raquel Fernández. 2021. Analysing human strategies of information transmission as a function of discourse context. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 647–660, Online. Association for Computational Linguistics.
  25. Is information density uniform in task-oriented dialogues? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8271–8283, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  26. Word complexity is in the eye of the beholder. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4439–4449, Online. Association for Computational Linguistics.
  27. Sian Gooding and Manuel Tragut. 2022. One size does not fit all: The case for personalised word complexity models. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 353–365, Seattle, United States. Association for Computational Linguistics.
  28. Jian Guan and Minlie Huang. 2020. UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9157–9166, Online. Association for Computational Linguistics.
  29. Rowena C. Guevarra. 2011. Development of a Filipino text readability index.
  30. Camille Guinaudeau and Michael Strube. 2013. Graph-based local coherence modeling. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 93–103, Sofia, Bulgaria. Association for Computational Linguistics.
  31. Robert Gunning et al. 1952. Technique of Clear Writing. McGraw-Hill.
  32. Combining lexical and grammatical features to improve readability measures for first and second language texts. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 460–467, Rochester, New York. Association for Computational Linguistics.
  33. Gustav Herdan. 1960. Type-token Mathematics: A Textbook of Mathematical Linguistics. Mouton.
  34. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization. In International Conference on Machine Learning, pages 4411–4421. PMLR.
  35. Conveying the predicted future to users: A case study of story plot prediction. In The AAAI-23 Workshop on Creative AI Across Modalities.
  36. Joseph Marvin Imperial and Ekaterina Kochmar. 2023. Automatic Readability Assessment for Closely Related Languages. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5371–5386, Toronto, Canada. Association for Computational Linguistics.
  37. Joseph Marvin Imperial and Ethel Ong. 2021. Diverse Linguistic Features for Assessing Reading Difficulty of Educational Filipino Texts. In 29th International Conference on Computers in Education Conference, ICCE 2021, pages 51–56. Asia-Pacific Society for Computers in Education.
  38. A baseline readability model for Cebuano. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022), pages 27–32, Seattle, Washington. Association for Computational Linguistics.
  39. Automatic educational question generation with difficulty level controls. In International Conference on Artificial Intelligence in Education, pages 476–488.
  40. Derivation Of New Readability Formulas (Automated Readability Index, Fog Count And Flesch Reading Ease Formula) For Navy Enlisted Personnel. Technical report, Naval Technical Training Command Millington TN Research Branch.
  41. George R Klare. 1974. Assessing Readability. Reading Research Quarterly, pages 62–102.
  42. Reformulating unsupervised style transfer as paraphrase generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 737–762, Online. Association for Computational Linguistics.
  43. Age-of-acquisition ratings for 30,000 English words. Behavior research methods, 44(4):978–990.
  44. Pushing on text readability assessment: A transformer meets handcrafted linguistic features. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10669–10686, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  45. Xiaofei Lu. 2012. The Relationship of Lexical Richness to the Quality of ESL Learners’ Oral Narratives. The Modern Language Journal, 96(2):190–208.
  46. G Harry Mc Laughlin. 1969. SMOG Grading - A New Readability Formula. Journal of reading, 12(8):639–646.
  47. Linguistic Features of Writing Quality. Written communication, 27(1):57–86.
  48. Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2241–2252, Copenhagen, Denmark. Association for Computational Linguistics.
  49. Kieran O’Loughlin. 1995. Lexical density in candidate output on direct and semi-direct versions of an oral proficiency test. Language Testing, 12(2):217–237.
  50. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  51. XCOPA: A multilingual dataset for causal commonsense reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, Online. Association for Computational Linguistics.
  52. Language Models are Unsupervised Multitask Learners. OpenAI Blog, 1(8):9.
  53. Ehud Reiter. 2018. A structured review of the validity of BLEU. Computational Linguistics, 44(3):393–401.
  54. Ehud Reiter and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529–558.
  55. Melissa Roemmele. 2021. Inspiration through Observation: Demonstrating the Influence of Generated Text on Creative Writing. arXiv preprint arXiv:2107.04007.
  56. Evaluating Story Generation Systems Using Automated Linguistic Analyses. In SIGKDD 2017 Workshop on Machine Learning for Creativity, pages 13–17.
  57. Perturbation CheckLists for evaluating NLG evaluation metrics. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7219–7234, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  58. Text Simplification from Professionally Produced Corpora. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  59. Sarah Schwarm and Mari Ostendorf. 2005. Reading level assessment using support vector machines and statistical language models. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 523–530, Ann Arbor, Michigan. Association for Computational Linguistics.
  60. Do massively pretrained language models make better storytellers? In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 843–861, Hong Kong, China. Association for Computational Linguistics.
  61. Automated readability index, volume 66. Aerospace Medical Research Laboratories, Aerospace Medical Division.
  62. Evaluating Evaluation Methods for Generation in the Presence of Variation. In CICLing, volume 2005, pages 341–351. Springer.
  63. Do long-range language models actually use long-range context? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 807–822, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  64. Edward Lee Thorndike. 1927. The teacher’s word book. Teachers College, Columbia University.
  65. Fiona J Tweedie and R Harald Baayen. 1998. How Variable May a Constant Be? Measures of Lexical Richness in Perspective. Computers and the Humanities, 32(5):323–352.
  66. Sowmya Vajjala and Detmar Meurers. 2016. Readability-Based Sentence Ranking for Evaluating Text Simplification. arXiv preprint arXiv:1603.06009.
  67. Using Broad Linguistic Complexity Modeling for Cross-Lingual Readability Assessment. In Proceedings of the 10th Workshop on NLP for Computer Assisted Language Learning, pages 38–54, Online. LiU Electronic Press.
  68. Bernard L Welch. 1947. The Generalization of STUDENT’S Problem When Several Different Population Variances Are Involved. Biometrika, 34(1-2):28–35.
  69. Sam Witteveen and Martin Andrews. 2019. Paraphrasing with large language models. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 215–220, Hong Kong. Association for Computational Linguistics.
  70. MEGATRON-CNTRL: Controllable story generation with external knowledge using large-scale language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2831–2845, Online. Association for Computational Linguistics.
  71. Problems in current text simplification research: New data can help. Transactions of the Association for Computational Linguistics, 3:283–297.
  72. Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4:401–415.
  73. Persona-guided planning for controlling the protagonist’s persona in story generation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3346–3361, Seattle, United States. Association for Computational Linguistics.
Citations (3)

Summary

We haven't generated a summary for this paper yet.