Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FRACAS: A FRench Annotated Corpus of Attribution relations in newS (2309.10604v1)

Published 19 Sep 2023 in cs.CL

Abstract: Quotation extraction is a widely useful task both from a sociological and from a Natural Language Processing perspective. However, very little data is available to study this task in languages other than English. In this paper, we present a manually annotated corpus of 1676 newswire texts in French for quotation extraction and source attribution. We first describe the composition of our corpus and the choices that were made in selecting the data. We then detail the annotation guidelines and annotation process, as well as a few statistics about the final corpus and the obtained balance between quote types (direct, indirect and mixed, which are particularly challenging). We end by detailing our inter-annotator agreement between the 8 annotators who worked on manual labelling, which is substantially high for such a difficult linguistic phenomenon.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. (2014). A joint model for quotation attribution and coreference resolution. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 39–48. Association for Computational Linguistics.
  2. Brunner, A. (2013). Automatic recognition of speech, thought, and writing representation in german narrative texts. 28(4):563–575.
  3. (2009). Extracting and visualizing quotations from news wires. Pages: 532.
  4. (2023). Sumren: Summarizing reported speech about events in news. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11):12808–12817.
  5. (2015). The unified and holistic method gamma (γ𝛾\gammaitalic_γ) for inter-annotator agreement measure and alignment. 41(3):437–479.
  6. (2018). An attribution relations corpus for political news. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May. European Language Resources Association (ELRA).
  7. NIST. (2005). RCV2 Reuters corpus, National Institute of Standards and Technology, Release date 2005-05-31, Format version 1, https://trec.nist.gov/data/reuters/reuters.html.
  8. Nylund, M. (2003). Quoting in front-page journalism: Illustrating, evaluating and confirming the news. 25(6):844–851. Number: 6 Publisher: SAGE Publications Ltd.
  9. (2012). Examining the impact of coreference resolution on quote attribution. page 10.
  10. (2019). Quotation detection and classification with a corpus-agnostic model. In Proceedings - Natural Language Processing in a Deep Learning World, pages 888–894. Incoma Ltd., Shoumen, Bulgaria.
  11. (2020). RiQuA: A corpus of rich quotation annotation for english literary text. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 835–841. European Language Resources Association.
  12. Pareti, S. (2012). A database of attribution relations. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC1́2), pages 3213–3217. European Language Resources Association (ELRA).
  13. Pareti, S. (2015). Attribution: A Computational Approach. Thesis, The University of Edinburgh.
  14. (2008). Repérage de citations, classification des styles de discours rapporté et identification des constituants citationnels en écrits journalistiques. In Traitement Automatique des Langues Naturelles, pages 450–459.
  15. (2007). Automatic detection of quotations in multilingual news. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, RANLP.
  16. (2010). A lexicon of french quotation verbs for automatic quotation extraction.
  17. (2017). Quote extraction and attribution from norwegian newspapers. In Proceedings of the 21st Nordic Conference of Computational Linguistics, pages 293–297. Linköping University Electronic Press.
  18. (2009). Automatic extraction of quotes and topics from news feeds. Accepted: 2019-02-04T12:33:55Z.
  19. (2016). Model architectures for quotation detection. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1736–1745. Association for Computational Linguistics.
  20. (2023). Radar de parit\’e: An NLP system to measure gender representation in french news stories. type: article.
  21. (2012). brat: a web-based tool for NLP-assisted text annotation. In Proceedings of the Demonstrations Session at EACL 2012, Avignon, France, April. Association for Computational Linguistics.
  22. (2019). Automatic recognition of direct speech without quotation marks. a rule- based approach. In Proceedings of Digital Humanaties: multimedial & multimodal, pages 87–89.
  23. (2021). Exploration and discovery of the COVID-19 literature through semantic visualization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 76–87, Online, June. Association for Computational Linguistics.
  24. (2021a). Quotebank: A corpus of quotations from a decade of news. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, page 328–336.
  25. (2021b). Quotebank: A corpus of quotations from a decade of news. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 328–336. ACM.
  26. (2022). Measuring presence of women and men as information sources in news.
Citations (1)

Summary

We haven't generated a summary for this paper yet.