Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ProLex: A Benchmark for Language Proficiency-oriented Lexical Substitution (2401.11356v3)

Published 21 Jan 2024 in cs.CL

Abstract: Lexical Substitution discovers appropriate substitutes for a given target word in a context sentence. However, the task fails to consider substitutes that are of equal or higher proficiency than the target, an aspect that could be beneficial for language learners looking to improve their writing. To bridge this gap, we propose a new task, language proficiency-oriented lexical substitution. We also introduce ProLex, a novel benchmark designed to assess systems' ability to generate not only appropriate substitutes but also substitutes that demonstrate better language proficiency. Besides the benchmark, we propose models that can automatically perform the new task. We show that our best model, a Llama2-13B model fine-tuned with task-specific synthetic data, outperforms ChatGPT by an average of 3.2% in F-score and achieves comparable results with GPT-4 on ProLex.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  2. Toefl11: A corpus of non-native english. ETS Research Report Series, 2013(2):i–15.
  3. Frank Boers and Stuart Webb. 2018. Teaching and learning collocation in adult second and foreign language learning. Language Teaching, 51(1):77–89.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Mitigating exposure bias in grammatical error correction with data augmentation and reweighting. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2115–2127.
  6. CathovenAI. 2023. Cefr checker (version 1.1.0) [web app]. ado language hub.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  8. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
  9. Daniel Dahlmeier and Hwee Tou Ng. 2012. Better evaluation for grammatical error correction. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 568–572, Montréal, Canada. Association for Computational Linguistics.
  10. Mark Davies. 2010. The corpus of contemporary american english as the first reliable monitor corpus of english. Literary and linguistic computing, 25(4):447–464.
  11. May Fan. 2000. How big is the gap and how to narrow it? an investigation into the active and passive vocabulary knowledge of l2 learners. Relc journal, 31(2):105–119.
  12. Na Fan. 2020. Strategy use in second language vocabulary learning and its relationships with the breadth and depth of vocabulary knowledge: A structural equation modeling study. Frontiers in psychology, 11:752.
  13. Christina Gitsaki. 1999. Second language lexical acquisition: A study of the development of collocational knowledge.
  14. Melanie C González. 2017. The contribution of lexical diversity to college-level writing. TESOL Journal, 8(4):899–919.
  15. Yongqi Gu and Robert Keith Johnson. 1996. Vocabulary learning strategies and language learning outcomes. Language learning, 46(4):643–679.
  16. Learning a lexical simplifier using wikipedia. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 458–463.
  17. Peter Howarth. 1998. Phraseology and second language proficiency. Applied linguistics, 19(1):24–44.
  18. Vocabulary knowledge and vocabulary use in second language writing. TESOL Journal, 7(3):700–715.
  19. What substitutes tell us-analysis of an “all-words” lexical substitution corpus. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 540–549.
  20. Genesis: a generative approach to substitutes in context. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10810–10823.
  21. Swords: A benchmark for lexical substitution with improved data coverage and quality. arXiv preprint arXiv:2106.04102.
  22. Exploring lexical bundles in low proficiency level l2 learners’ english writing: an ets corpus study. Applied Linguistics Review, 14(4):847–873.
  23. Diana McCarthy. 2002. Lexical substitution as a task for wsd evaluation. In Proceedings of the ACL-02 workshop on Word sense disambiguation: recent successes and future directions, pages 89–115.
  24. Diana McCarthy and Roberto Navigli. 2007. Semeval-2007 task 10: English lexical substitution task. In Proceedings of the fourth international workshop on semantic evaluations (SemEval-2007), pages 48–53.
  25. Council of Europe Education Committee Modern Languages Division Council for Cultural Co-operation. 2001. Common European framework of reference for languages: Learning, teaching, assessment. Cambridge University Press.
  26. Gector–grammatical error correction: tag, not rewrite. arXiv preprint arXiv:2005.12592.
  27. OpenAI. 2023. Gpt-4 technical report.
  28. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  29. Parals: Lexical substitution via pretrained paraphraser. arXiv preprint arXiv:2305.08146.
  30. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  31. A simple recipe for multilingual grammatical error correction. arXiv preprint arXiv:2106.03830.
  32. Geneva Smitherman and Victor Villanueva. 2003. Language diversity in the classroom: From intention to practice. SIU Press.
  33. What affects second language vocabulary learning? evidence from multivariate analysis. In Frontiers in Education, volume 8, page 1210640. Frontiers.
  34. Ensembling and knowledge distilling of large sequence taggers for grammatical error correction. arXiv preprint arXiv:2203.13064.
  35. Contextualizing semantic representations using syntactically enriched vector models. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 948–957.
  36. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  37. Unsupervised lexical substitution with decontextualised embeddings. arXiv preprint arXiv:2209.08236.
  38. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  39. Tracing text provenance via context-aware lexical substitution. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11613–11621.
  40. Lm-critic: Language models for unsupervised grammatical error correction. arXiv preprint arXiv:2109.06822.
  41. Guoxing Yu. 2010. Lexical diversity in writing and speaking task performances. Applied linguistics, 31(2):236–259.
  42. ErAConD: Error annotated conversational dialog dataset for grammatical error correction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 76–84, Seattle, United States. Association for Computational Linguistics.
  43. Judging llm-as-a-judge with mt-bench and chatbot arena.
  44. Bert-based lexical substitution. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3368–3373.
Citations (4)

Summary

  • The paper introduces ProLex, a benchmark that evaluates proficiency-oriented lexical substitution to enhance vocabulary diversity among L2 learners.
  • It leverages a human-annotated dataset from TOEFL-11 essays and candidate substitutes generated by GPT-4 to ensure contextual and grammatical accuracy.
  • Models like Llama2-13B, fine-tuned with synthetic data, outperformed larger LLMs, demonstrating effective proficiency-based lexical substitutions.

Introduction

In the sphere of automatic English learning tools, while grammar correction systems have received considerable attention, enhancing vocabulary diversity through apt lexical choices remains integral. Researchers have identified a challenge for English second-language (L2) learners: they tend to rely on a limited vocabulary set, impeding their performance in expressive writing. Existing lexical substitution systems aid learners in identifying appropriate word alternatives within a given context, promoting vocabulary expansion, but prior work largely disregards proficiency level in substituting target words.

ProLex Benchmark

To address this gap, the paper presents ProLex, a benchmark for evaluating language proficiency-oriented lexical substitution, advancing beyond the current paradigm that prioritizes contextual suitability. ProLex is grounded in the frequency of target words from the TOEFL-11 essay corpus, which represents typical L2 English learner usage patterns. This focus ensures that the benchmark aligns with the lexicon of beginner learners. A salient feature of ProLex is its human-annotated dataset, where human experts gauge candidate substitutes generated by GPT-4, following a comprehensive annotation scheme covering aspects like semantic integrity, collocation accuracy, lexical variation, and grammatical correctness.

Methodology and Model Performance

To facilitate automated assessment of this task, models were developed and benchmarked against ProLex. One model of note is the Llama2-13B model, fine-tuned with synthetic data tailored to the task, which outshined contemporary large-scale LLMs in performance metrics. GPT-4's proficiency in zero-shot and in-context learning settings further illustrates the feasibility of LLMs in addressing semantically complex tasks such as lexical substitution with a proficiency orientation.

Conclusions and Prospects

In summary, the introduction of ProLex paves the way for substantial advancements in computational English language learning, particularly in honing vocabulary breadth and writing dexterity among L2 learners. The benchmark empowers systems to recommend lexically diverse and proficient word substitutions, facilitating educational progress. Moving forward, the corpus intends to expand, refining its representativeness and fostering system advancements in the field of L2 instructional technology.

X Twitter Logo Streamline Icon: https://streamlinehq.com