Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature (2210.09932v2)

Published 18 Oct 2022 in cs.CL

Abstract: Lay summarisation aims to jointly summarise and simplify a given text, thus making its content more comprehensible to non-experts. Automatic approaches for lay summarisation can provide significant value in broadening access to scientific literature, enabling a greater degree of both interdisciplinary knowledge sharing and public understanding when it comes to research findings. However, current corpora for this task are limited in their size and scope, hindering the development of broadly applicable data-driven approaches. Aiming to rectify these issues, we present two novel lay summarisation datasets, PLOS (large-scale) and eLife (medium-scale), each of which contains biomedical journal articles alongside expert-written lay summaries. We provide a thorough characterisation of our lay summaries, highlighting differing levels of readability and abstractiveness between datasets that can be leveraged to support the needs of different applications. Finally, we benchmark our datasets using mainstream summarisation approaches and perform a manual evaluation with domain experts, demonstrating their utility and casting light on the key challenges of this task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Interdisciplinary promises versus practices in medicine: The decoupled experiences of social sciences and humanities scholars. Social Science and Medicine, 126:17–25.
  2. Barriers to cross-disciplinary knowledge flow: The case of medical education research. Perspect. Med. Educ., 11(3):149–155.
  3. EASSE: Easier automatic sentence simplification evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 49–54, Hong Kong, China. Association for Computational Linguistics.
  4. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computational Linguistics.
  5. TLDR: Extreme summarization of scientific documents. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4766–4777, Online. Association for Computational Linguistics.
  6. Overview and insights from the shared tasks at scholarly document processing 2020: CL-SciSumm, LaySumm and. In Proceedings of the First Workshop on Scholarly Document Processing, pages 214–224. unknown.
  7. Pretrained language models for sequential sentence classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3693–3699, Hong Kong, China. Association for Computational Linguistics.
  8. A discourse-aware attention model for abstractive summarization of long documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 615–621, New Orleans, Louisiana. Association for Computational Linguistics.
  9. Franck Dernoncourt and Ji Young Lee. 2017. PubMed 200k RCT: a dataset for sequential sentence classification in medical abstracts. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 308–313, Taipei, Taiwan. Asian Federation of Natural Language Processing.
  10. Paragraph-level simplification of medical texts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4972–4984, Online. Association for Computational Linguistics.
  11. Discourse-Aware unsupervised summarization for long scientific documents. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1089–1102, Online. Association for Computational Linguistics.
  12. LexRank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22:457–479.
  13. Automated Lay Language Summarization of Biomedical Scientific Reviews. Proceedings of the AAAI Conference on Artificial Intelligence, 35(1):160–168.
  14. COVID-19–Related Infodemic and Its Impact on Public Health: A Global Social Media Analysis. The American Journal of Tropical Medicine and Hygiene, 103(4):1621–1629.
  15. Plain-language Summaries of Research: An inside guide to eLife digests. eLife, 6:e25410.
  16. Lauren M Kuehne and Julian D Olden. 2015. Lay summaries needed to enhance science communication. Proceedings of the National Academy of Sciences, 112(12):3585–3586.
  17. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  18. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  19. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. 7th International Conference on Learning Representations.
  20. On faithfulness and factuality in abstractive summarization. In Proceedings of The 58th Annual Meeting of the Association for Computational Linguistics (ACL).
  21. Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 404–411, Barcelona, Spain. Association for Computational Linguistics.
  22. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, page 3075–3081. AAAI Press.
  23. ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 319–327, Florence, Italy. Association for Computational Linguistics.
  24. Named entity aware transfer learning for biomedical factoid question answering. IEEE/ACM Transactions on Computational Biology and Bioinformatics.
  25. Nipun Sadvilkar and Mark Neumann. 2020. PySBD: Pragmatic sentence boundary disambiguation. In Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pages 110–114, Online. Association for Computational Linguistics.
  26. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
  27. BIGPATENT: A Large-Scale dataset for abstractive and coherent summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2204–2213, Florence, Italy. Association for Computational Linguistics.
  28. Barry Smith. 2006. From concepts to clinical reality: An essay on the benchmarking of biomedical terminologies. Journal of Biomedical Informatics, 39(3):288–298.
  29. Josef Steinberger and Karel Jezek. 2004. Using latent semantic analysis in text summarization and summary evaluation. pages 93–100. 7th International Conference ISIM.
  30. Huggingface’s transformers: State-of-the-art natural language processing.
  31. HTSS: A novel hybrid text summarisation and simplification architecture. Inf. Process. Manag., 57(6):102351.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Tomas Goldsack (10 papers)
  2. Zhihao Zhang (61 papers)
  3. Chenghua Lin (127 papers)
  4. Carolina Scarton (52 papers)
Citations (64)

Summary

We haven't generated a summary for this paper yet.