Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 86 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 19 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 129 tok/s Pro
GPT OSS 120B 430 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

A statistical significance testing approach for measuring term burstiness with applications to domain-specific terminology extraction (2310.15790v2)

Published 24 Oct 2023 in cs.IR

Abstract: A term in a corpus is said to be ``bursty'' (or overdispersed) when its occurrences are concentrated in few out of many documents. In this paper, we propose Residual Inverse Collection Frequency (RICF), a statistical significance test inspired heuristic for quantifying term burstiness. The chi-squared test is, to our knowledge, the sole test of statistical significance among existing term burstiness measures. Chi-squared test term burstiness scores are computed from the collection frequency statistic (i.e., the proportion that a specified term constitutes in relation to all terms within a corpus). However, the document frequency of a term (i.e., the proportion of documents within a corpus in which a specific term occurs) is exploited by certain other widely used term burstiness measures. RICF addresses this shortcoming of the chi-squared test by virtue of its term burstiness scores systematically incorporating both the collection frequency and document frequency statistics. We evaluate the RICF measure on a domain-specific technical terminology extraction task using the GENIA Term corpus benchmark, which comprises 2,000 annotated biomedical article abstracts. RICF generally outperformed the chi-squared test in terms of precision at k score with percent improvements of 0.00% (P@10), 6.38% (P@50), 6.38% (P@100), 2.27% (P@500), 2.61% (P@1000), and 1.90% (P@5000). Furthermore, RICF performance was competitive with the performances of other well-established measures of term burstiness. Based on these findings, we consider our contributions in this paper as a promising starting point for future exploration in leveraging statistical significance testing in text analysis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Domain-specific keyword extraction using joint modeling of local and global contextual semantics. ACM Transactions on Knowledge Discovery from Data (TKDD) 16, 1–30.
  2. TF-TDA: A novel supervised term weighting scheme for sentiment analysis. Electronics 12, 1632.
  3. Term weighting scheme for short-text classification: Twitter corpuses. Neural Computing and Applications 31, 3819–3831.
  4. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems 20, 357–389.
  5. Semantic-aware retrieval and recommendation based on the Dirichlet compound language model. Research Square preprint 10.21203/rs.3.rs-2235180/v1 .
  6. YAKE! keyword extraction from single documents using multiple local features. Information Sciences 509, 257–289.
  7. An alternative to Juilland’s usage coefficient for lexical frequencies. ETS Research Bulletin Series 1970, i–15.
  8. TF-IDFC-RF: a novel supervised term weighting scheme. arXiv preprint arXiv:2003.07193 .
  9. Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Systems with Applications 66, 245–260.
  10. Inverse document frequency (idf): A measure of deviations from Poisson, in: Natural language processing using very large corpora. Springer, pp. 283–295.
  11. Poisson mixtures. Natural Language Engineering 1, 163–190.
  12. Modelling word burstiness in natural language: A generalised Polya process for document language models in information retrieval.
  13. A Pólya urn document language model for improved information retrieval. ACM Transactions on Information Systems 33.
  14. On term frequency factor in supervised term weighting schemes for text classification. Arabian Journal for Science and Engineering 44, 9545–9560.
  15. A comparison of term weighting schemes for text classification and sentiment analysis with a supervised variant of tf.idf, in: Data Management Technologies and Applications: 4th International Conference, DATA 2015, Colmar, France, July 20-22, 2015, Revised Selected Papers 4, Springer. pp. 39–58.
  16. Fréquence et distribution du vocabulaire dans un choix de romans français. Skriptor, Stockholm.
  17. Keyword extraction: Issues and methods. Natural Language Engineering 26, 259–291.
  18. Effects of central tendency measures on term weighting in textual information retrieval. Soft Computing 25, 7341–7378.
  19. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13, 403–437.
  20. KeyBERT: Minimal keyword extraction with BERT. doi:10.5281/zenodo.4461265.
  21. A comprehensive analysis of bilingual lexicon induction. Computational Linguistics 43, 273–310.
  22. On the burstiness of visual elements, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1169–1176.
  23. BM25-CTF: Improving TF and IDF factors in BM25 by using collection term frequencies. Journal of Intelligent & Fuzzy Systems 34, 2887–2899.
  24. Frequency Dictionary of Spanish Words. De Gruyter Mouton, Berlin, Boston.
  25. Improving image pair selection for large scale structure from motion by introducing modified simpson coefficient. IEICE TRANSACTIONS on Information and Systems 105, 1590–1599.
  26. Distribution of content words and phrases in text and language modelling. Natural language engineering 2, 15–59.
  27. Featureless deep learning methods for automated key-term extraction.
  28. Putting frequencies in the dictionary. International journal of lexicography 10, 135–155.
  29. Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Information sciences 477, 15–29.
  30. GENIA corpus—a semantically annotated corpus for bio-textmining. Bioinformatics 19, i180–i182.
  31. Experiments with a component theory of probabilistic information retrieval based on single terms as document components. ACM Transactions on Information Systems (TOIS) 8, 363–386.
  32. A new method of weighting query terms for ad-hoc retrieval, in: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, New York, NY, USA. p. 187–195.
  33. From puppy to maturity: Experiences in developing Terrier. Proc. of OSIR at SIGIR , 60–63.
  34. Modeling word burstiness using the Dirichlet distribution, in: Proceedings of the 22Nd International Conference on Machine Learning, ACM, New York, NY, USA. p. 545–552.
  35. Delta tfidf: An improved feature space for sentiment analysis, in: Proceedings of the International AAAI Conference on Web and Social Media, pp. 258–261.
  36. Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13, 157–169.
  37. Five proofs of Chernoff’s bound with applications. arXiv preprint arXiv:1801.03365 .
  38. MySQL 8.0 Reference Manual: 12.9.4 Full-Text Stopwords, . MySQL 8.0 reference manual: 12.9.4 full-text stopwords. URL: https://dev.mysql.com/doc/refman/8.0/en/fulltext-stopwords.html.
  39. A comparative assessment of unsupervised keyword extraction tools. IEEE Access , 1–1.
  40. An automatic method of the extraction of important words from Japanese scientific documents. IPS Japan 17.
  41. Sentiment analysis using smoothed probabilistic-based models, in: 2023 9th International Conference on Control, Decision and Information Technologies (CoDIT), pp. 1185–1190. doi:10.1109/CoDIT58514.2023.10284166.
  42. Keyword extraction: a modern perspective. SN Computer Science 4, 92.
  43. Burstiness of verbs and derived nouns, in: Shall We Play the Festschrift Game? Essays on the Occasion of Lauri Carlson’s 60th Birthday. Springer, pp. 99–115.
  44. Employing the resolution power of search keys. Journal of the American Society for Information Science and Technology 52, 575–583.
  45. The RATF formula (Kwok’s formula): exploiting average term frequency in cross-language retrieval. Information Research 7, 7–2.
  46. Resilient retrieval models for large collection, in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3492–3492.
  47. The vector space model in information retrieval-term weighting problem. Entropy 34, 9.
  48. A language modeling approach to information retrieval, in: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, New York, NY, USA. p. 275–281.
  49. The importance of term weighting in semantic understanding of text: A review of techniques. Multimedia Tools and Applications 82, 9761–9783.
  50. TF-ICF: A new term weighting scheme for clustering dynamic data streams, in: 2006 5th International Conference on Machine Learning and Applications (ICMLA’06), IEEE. pp. 258–263.
  51. Class-indexing-based term weighting for automatic text classification. Information Sciences 236, 109–125.
  52. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval 3, 333–389.
  53. Automatic keyword extraction from individual documents. Text mining: applications and theory , 1–20.
  54. The quantitative concept of language and its relation to the structure of frequency dictionaries. Etudes de linguistique appliquée 1, 103.
  55. Modified frequency-based term weighting schemes for text classification. Applied Soft Computing 58, 193–206.
  56. On the specification of term values in automatic indexing. Journal of Documentation 29, 351–372.
  57. Profiling specialized web corpus qualities: A progress report on “domainhood”. Argentinian Journal of Applied Linguistics 7, 8–26.
  58. Can we quantify domainhood? exploring measures to assess domain-specificity in web corpora, in: Database and Expert Systems Applications: DEXA 2018 International Workshops, BDMICS, BIOKDD, and TIR, Regensburg, Germany, September 3–6, 2018, Proceedings 29, Springer. pp. 207–217.
  59. Inducing translation lexicons via diverse similarity measures and bridge languages, in: COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002), pp. 146–152.
  60. Know thy corpus! Exploring frequency distributions in large corpora, in: Diab, M., Villavicencio, A. (Eds.), Essays in Honor of Adam Kilgarriff. Springer. Text, Speech and Language Technology, pp. 1–14.
  61. Evaluation of keyness metrics: Reliability and interpretability. PsyArXiv preprint PPR:PPR523632 .
  62. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11–21.
  63. Text classification algorithm based on TF-IDF and BERT, in: 2022 11th International Conference of Information and Communication Technology (ICTech)), IEEE. pp. 1–4.
  64. The recent advances in automatic term extraction: A survey. arXiv preprint arXiv:2301.06767 .
  65. Towards codable text watermarking for large language models. arXiv preprint arXiv:2307.15992 .
  66. Featureless domain-specific term extraction with minimal labelled data, in: Proceedings of the Australasian Language Technology Association Workshop 2016, pp. 103–112.
  67. Text classification in shipping industry using unsupervised models and transformer based supervised models. arXiv preprint arXiv:2212.12407 .
  68. Data mining topics in the discipline of library and information science: analysis of influential terms and dirichlet multinomial regression topic model. Aslib Journal of Information Management .
  69. A study of smoothing methods for language models applied to ad hoc information retrieval, in: ACM SIGIR Forum, ACM New York, NY, USA. pp. 268–276.
  70. JATE 2.0: Java automatic term extraction with Apache Solr, in: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 2262–2269.

Summary

We haven't generated a summary for this paper yet.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.