Advancing Topic Segmentation and Outline Generation in Chinese Texts: The Paragraph-level Topic Representation, Corpus, and Benchmark
Abstract: Topic segmentation and outline generation strive to divide a document into coherent topic sections and generate corresponding subheadings, unveiling the discourse topic structure of a document. Compared with sentence-level topic structure, the paragraph-level topic structure can quickly grasp and understand the overall context of the document from a higher level, benefitting many downstream tasks such as summarization, discourse parsing, and information retrieval. However, the lack of large-scale, high-quality Chinese paragraph-level topic structure corpora restrained relative research and applications. To fill this gap, we build the Chinese paragraph-level topic representation, corpus, and benchmark in this paper. Firstly, we propose a hierarchical paragraph-level topic structure representation with three layers to guide the corpus construction. Then, we employ a two-stage man-machine collaborative annotation method to construct the largest Chinese Paragraph-level Topic Structure corpus (CPTS), achieving high quality. We also build several strong baselines, including ChatGPT, to validate the computability of CPTS on two fundamental tasks (topic segmentation and outline generation) and preliminarily verified its usefulness for the downstream task (discourse parsing).
- Sebastian Arnold and Rudolf Schneider and Philippe Cudré-Mauroux and Felix A. Gers and Alexander Löser. 2019. SECTOR: A Neural Model for Coherent Topic Segmentation and Classification.
- Attention-based neural text segmentation. In ECIR, pages 180–193.
- A joint model for document segmentation and segment labeling. In ACL, pages 313–322.
- Cognitive psychology and instruction. ERIC.
- Wallace Chafe. 1994. Discourse, consciousness, and time: The flow and displacement of conscious experience in speaking and writing.
- Harr Chen and S. R. K. Branavan and Regina Barzilay and David R. Karger. 2009. Global Models of Document Structure using Latent Permutations.
- Freddy Y. Y. Choi. 2000. Advances in domain independent linear text segmentation.
- Latent semantic analysis for text segmentation. In EMNLP.
- Prafulla Kumar Choubey and Ruihong Huang. 2021. Profiling News Discourse Structure Using Explicit Subtopic Structures Guided Critics. In Findings of EMNLP, pages 1594–1605.
- Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pages 4171–4186.
- Jacob Eisenstein. 2009. Hierarchical text segmentation from multi-scale lexical cohesion. In NAACL-HLT, pages 353–361.
- Jacob Eisenstein and Regina Barzilay. 2008. Bayesian Unsupervised Topic Segmentation.
- Uncovering the potential of chatgpt for discourse analysis in dialogue: An empirical study.
- Chris Fournier. 2013. Evaluating text segmentation using boundary edit distance. In ACL, pages 1702–1712.
- Chris Fournier and Diana Inkpen. 2012. Segmentation similarity and agreement. In NAACL-HLT, pages 152–161.
- Unsupervised text segmentation using semantic relatedness graphs. In *SEM@ACL.
- Goran Glavas and Swapna Somasundaran. 2020. Two-level transformer and auxiliary coherence modeling for improved text segmentation. In AAAI, pages 7797–7804.
- Dionysis Goutsos. 1997. Modeling discourse topic: sequential relations and strategies in expository text, volume 59. Greenwood Publishing Group.
- Joseph E Grimes. 2015. The thread of discourse. De Gruyter Mouton.
- Marti A. Hearst. 1997. Texttiling: Segmenting text into multi-paragraph subtopic passages. Comput. Linguistics, 23(1):33–64.
- Predicting above-sentence discourse structure using distant supervision from topic segmentation. In AAAI.
- Hierarchical macro discourse parsing based on topic segmentation. In AAAI, pages 13152–13160.
- MCDTB: A macro-level Chinese discourse TreeBank. In COLING, pages 3493–3504.
- Omri Koshorek and Adir Cohen and Noam Mor and Michael Rotman and Jonathan Berant. 2018. Text Segmentation as a Supervised Learning Task.
- BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL, pages 7871–7880.
- SEGBOT: A generic neural text segmentation model with pointer network. In IJCAI, pages 4166–4172.
- Chin-Yew Lin. 2004a. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Chin-Yew Lin. 2004b. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. In EMNLP-IJCNLP, pages 3728–3738.
- End-to-End Segmentation-based News Summarization. Association for Computational Linguistics.
- Transformer over pre-trained transformer for neural text segmentation with enhanced topic coherence. In Findings of EMNLP, pages 3334–3340.
- Text Segmentation by Cross Segment Attention. In EMNLP, pages 4707–4716.
- Igor Malioutov and Regina Barzilay. 2006. Minimum cut model for spoken lecture segmentation. In ACL.
- Marie-Francine Moens and Rik De Busser. 2001. Generic topic segmentation of document texts. In SIGIR, pages 418–419.
- Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Lev Pevzner and Marti A Hearst. 2002. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28(1):19–36.
- Jay M Ponte and W Bruce Croft. 1997. Text segmentation by topic. In International Conference on Theory and Practice of Digital Libraries, pages 113–125.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
- Martin Riedl and Chris Biemann. 2012. Topictiling: a text segmentation algorithm based on lda. In Proceedings of ACL 2012 Student Research Workshop, pages 37–42.
- Chinese Gigaword Fourth Edition. Linguistic Data Consortium. ISLRN 261-416-300-929-8.
- Clive Seale and David Silverman. 1997. Ensuring rigour in qualitative research. The European journal of public health, 7(4):379–384.
- Manfred Stede. 2011. Discourse processing. Synthesis Lectures on Human Language Technologies, 4(3):1–165.
- Richard Watson Todd. 2011. Analyzing discourse topics and topic keywords.
- Richard Watson Todd. 2016. Discourse topics, volume 269. John Benjamins Publishing Company.
- Masao Utiyama and Hitoshi Isahara. 2001. A statistical model for domain-independent text segmentation. In ACL, pages 491–498.
- Teun A Van Dijk. 2014. Discourse and knowledge: A sociocognitive approach. Cambridge University Press.
- Teun A Van Dijk and Walter Kintsch. 1983. Strategies of discourse comprehension. Acadamic Press.
- Liang Wang and Sujian Li and Xinyan Xiao and Yajuan Lyu. 2016. Topic Segmentation of Web Documents with Automatic Cue Phrase Identification and BLSTM-CNN.
- Richard Watson Todd. 2003. Topics in classroom discourse. Ph.D. thesis, UK: University of Liverpool.
- Wen Xiao and Giuseppe Carenini. 2019. Extractive summarization of long documents by combining global and local context. In EMNLP-IJCNLP, pages 3011–3021, Hong Kong, China. Association for Computational Linguistics.
- Linzi Xing and Brad Hackinen and Giuseppe Carenini and Francesco Trebbi. 2020. Improving Context Modeling in Neural Topic Segmentation.
- Improving topic segmentation by injecting discourse dependencies. In Proceedings of 3rd Workshop on Computational Approaches to Discourse (CODI 2022), page 7.
- Yi Xu and Hai Zhao and Zhuosheng Zhang. 2021. Topic-Aware Multi-turn Dialogue Modeling.
- Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS.
- Mug: A general meeting understanding and generation benchmark. IEEE.
- Outline generation: Understanding the inherent content structure of documents. In SIGIR, pages 745–754.
- Bertscore: Evaluating text generation with BERT. In ICLR.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.