A statistical significance testing approach for measuring term burstiness with applications to domain-specific terminology extraction (2310.15790v2)
Abstract: A term in a corpus is said to be ``bursty'' (or overdispersed) when its occurrences are concentrated in few out of many documents. In this paper, we propose Residual Inverse Collection Frequency (RICF), a statistical significance test inspired heuristic for quantifying term burstiness. The chi-squared test is, to our knowledge, the sole test of statistical significance among existing term burstiness measures. Chi-squared test term burstiness scores are computed from the collection frequency statistic (i.e., the proportion that a specified term constitutes in relation to all terms within a corpus). However, the document frequency of a term (i.e., the proportion of documents within a corpus in which a specific term occurs) is exploited by certain other widely used term burstiness measures. RICF addresses this shortcoming of the chi-squared test by virtue of its term burstiness scores systematically incorporating both the collection frequency and document frequency statistics. We evaluate the RICF measure on a domain-specific technical terminology extraction task using the GENIA Term corpus benchmark, which comprises 2,000 annotated biomedical article abstracts. RICF generally outperformed the chi-squared test in terms of precision at k score with percent improvements of 0.00% (P@10), 6.38% (P@50), 6.38% (P@100), 2.27% (P@500), 2.61% (P@1000), and 1.90% (P@5000). Furthermore, RICF performance was competitive with the performances of other well-established measures of term burstiness. Based on these findings, we consider our contributions in this paper as a promising starting point for future exploration in leveraging statistical significance testing in text analysis.
- Domain-specific keyword extraction using joint modeling of local and global contextual semantics. ACM Transactions on Knowledge Discovery from Data (TKDD) 16, 1–30.
- TF-TDA: A novel supervised term weighting scheme for sentiment analysis. Electronics 12, 1632.
- Term weighting scheme for short-text classification: Twitter corpuses. Neural Computing and Applications 31, 3819–3831.
- Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems 20, 357–389.
- Semantic-aware retrieval and recommendation based on the Dirichlet compound language model. Research Square preprint 10.21203/rs.3.rs-2235180/v1 .
- YAKE! keyword extraction from single documents using multiple local features. Information Sciences 509, 257–289.
- An alternative to Juilland’s usage coefficient for lexical frequencies. ETS Research Bulletin Series 1970, i–15.
- TF-IDFC-RF: a novel supervised term weighting scheme. arXiv preprint arXiv:2003.07193 .
- Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Systems with Applications 66, 245–260.
- Inverse document frequency (idf): A measure of deviations from Poisson, in: Natural language processing using very large corpora. Springer, pp. 283–295.
- Poisson mixtures. Natural Language Engineering 1, 163–190.
- Modelling word burstiness in natural language: A generalised Polya process for document language models in information retrieval.
- A Pólya urn document language model for improved information retrieval. ACM Transactions on Information Systems 33.
- On term frequency factor in supervised term weighting schemes for text classification. Arabian Journal for Science and Engineering 44, 9545–9560.
- A comparison of term weighting schemes for text classification and sentiment analysis with a supervised variant of tf.idf, in: Data Management Technologies and Applications: 4th International Conference, DATA 2015, Colmar, France, July 20-22, 2015, Revised Selected Papers 4, Springer. pp. 39–58.
- Fréquence et distribution du vocabulaire dans un choix de romans français. Skriptor, Stockholm.
- Keyword extraction: Issues and methods. Natural Language Engineering 26, 259–291.
- Effects of central tendency measures on term weighting in textual information retrieval. Soft Computing 25, 7341–7378.
- Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13, 403–437.
- KeyBERT: Minimal keyword extraction with BERT. doi:10.5281/zenodo.4461265.
- A comprehensive analysis of bilingual lexicon induction. Computational Linguistics 43, 273–310.
- On the burstiness of visual elements, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1169–1176.
- BM25-CTF: Improving TF and IDF factors in BM25 by using collection term frequencies. Journal of Intelligent & Fuzzy Systems 34, 2887–2899.
- Frequency Dictionary of Spanish Words. De Gruyter Mouton, Berlin, Boston.
- Improving image pair selection for large scale structure from motion by introducing modified simpson coefficient. IEICE TRANSACTIONS on Information and Systems 105, 1590–1599.
- Distribution of content words and phrases in text and language modelling. Natural language engineering 2, 15–59.
- Featureless deep learning methods for automated key-term extraction.
- Putting frequencies in the dictionary. International journal of lexicography 10, 135–155.
- Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Information sciences 477, 15–29.
- GENIA corpus—a semantically annotated corpus for bio-textmining. Bioinformatics 19, i180–i182.
- Experiments with a component theory of probabilistic information retrieval based on single terms as document components. ACM Transactions on Information Systems (TOIS) 8, 363–386.
- A new method of weighting query terms for ad-hoc retrieval, in: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, New York, NY, USA. p. 187–195.
- From puppy to maturity: Experiences in developing Terrier. Proc. of OSIR at SIGIR , 60–63.
- Modeling word burstiness using the Dirichlet distribution, in: Proceedings of the 22Nd International Conference on Machine Learning, ACM, New York, NY, USA. p. 545–552.
- Delta tfidf: An improved feature space for sentiment analysis, in: Proceedings of the International AAAI Conference on Web and Social Media, pp. 258–261.
- Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13, 157–169.
- Five proofs of Chernoff’s bound with applications. arXiv preprint arXiv:1801.03365 .
- MySQL 8.0 Reference Manual: 12.9.4 Full-Text Stopwords, . MySQL 8.0 reference manual: 12.9.4 full-text stopwords. URL: https://dev.mysql.com/doc/refman/8.0/en/fulltext-stopwords.html.
- A comparative assessment of unsupervised keyword extraction tools. IEEE Access , 1–1.
- An automatic method of the extraction of important words from Japanese scientific documents. IPS Japan 17.
- Sentiment analysis using smoothed probabilistic-based models, in: 2023 9th International Conference on Control, Decision and Information Technologies (CoDIT), pp. 1185–1190. doi:10.1109/CoDIT58514.2023.10284166.
- Keyword extraction: a modern perspective. SN Computer Science 4, 92.
- Burstiness of verbs and derived nouns, in: Shall We Play the Festschrift Game? Essays on the Occasion of Lauri Carlson’s 60th Birthday. Springer, pp. 99–115.
- Employing the resolution power of search keys. Journal of the American Society for Information Science and Technology 52, 575–583.
- The RATF formula (Kwok’s formula): exploiting average term frequency in cross-language retrieval. Information Research 7, 7–2.
- Resilient retrieval models for large collection, in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3492–3492.
- The vector space model in information retrieval-term weighting problem. Entropy 34, 9.
- A language modeling approach to information retrieval, in: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, New York, NY, USA. p. 275–281.
- The importance of term weighting in semantic understanding of text: A review of techniques. Multimedia Tools and Applications 82, 9761–9783.
- TF-ICF: A new term weighting scheme for clustering dynamic data streams, in: 2006 5th International Conference on Machine Learning and Applications (ICMLA’06), IEEE. pp. 258–263.
- Class-indexing-based term weighting for automatic text classification. Information Sciences 236, 109–125.
- The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval 3, 333–389.
- Automatic keyword extraction from individual documents. Text mining: applications and theory , 1–20.
- The quantitative concept of language and its relation to the structure of frequency dictionaries. Etudes de linguistique appliquée 1, 103.
- Modified frequency-based term weighting schemes for text classification. Applied Soft Computing 58, 193–206.
- On the specification of term values in automatic indexing. Journal of Documentation 29, 351–372.
- Profiling specialized web corpus qualities: A progress report on “domainhood”. Argentinian Journal of Applied Linguistics 7, 8–26.
- Can we quantify domainhood? exploring measures to assess domain-specificity in web corpora, in: Database and Expert Systems Applications: DEXA 2018 International Workshops, BDMICS, BIOKDD, and TIR, Regensburg, Germany, September 3–6, 2018, Proceedings 29, Springer. pp. 207–217.
- Inducing translation lexicons via diverse similarity measures and bridge languages, in: COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002), pp. 146–152.
- Know thy corpus! Exploring frequency distributions in large corpora, in: Diab, M., Villavicencio, A. (Eds.), Essays in Honor of Adam Kilgarriff. Springer. Text, Speech and Language Technology, pp. 1–14.
- Evaluation of keyness metrics: Reliability and interpretability. PsyArXiv preprint PPR:PPR523632 .
- A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11–21.
- Text classification algorithm based on TF-IDF and BERT, in: 2022 11th International Conference of Information and Communication Technology (ICTech)), IEEE. pp. 1–4.
- The recent advances in automatic term extraction: A survey. arXiv preprint arXiv:2301.06767 .
- Towards codable text watermarking for large language models. arXiv preprint arXiv:2307.15992 .
- Featureless domain-specific term extraction with minimal labelled data, in: Proceedings of the Australasian Language Technology Association Workshop 2016, pp. 103–112.
- Text classification in shipping industry using unsupervised models and transformer based supervised models. arXiv preprint arXiv:2212.12407 .
- Data mining topics in the discipline of library and information science: analysis of influential terms and dirichlet multinomial regression topic model. Aslib Journal of Information Management .
- A study of smoothing methods for language models applied to ad hoc information retrieval, in: ACM SIGIR Forum, ACM New York, NY, USA. pp. 268–276.
- JATE 2.0: Java automatic term extraction with Apache Solr, in: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 2262–2269.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.